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COLLEGE PRESS AND STUDENT ACHIEVEMENT! 


DONALD L. THISTLETHWAITE 
National Merit Scholarship Corporation, Evanston, Illinois 


The National Merit Scholarship program 
offers an unusual opportunity for studying 
the development of intellectual talent. The 
number of high school students participat- 
ing in this annual nationwide talent search 
increased from 58,000 in 1956 to almost a 
half a million in 1959. It is estimated that 
well over half of all the high school seniors 
in the top 30% of the population as re- 
gards intellectual ability participated in the 
1959 program. One of the aims of this study 
was to see what could be learned from these 
students about the kinds of environments 
conducive to the realization of their po- 
tentialities. 

The analysis related the environmental 
press at different colleges, as measured by 
the College Characteristics Index (Pace & 
Stern, 1958), to measures of student 
achievement. The criterion of achievement 
was the percentage of the college’s alumni 
who later earned doctorates. In the educa- 
tional literature, this criterion has fre- 
quently been interpreted as a measure of 
the college’s productivity (Knapp & Good- 
rich, 1952; Knapp & Greenbaum, 1953). 


1 Based upon a paper presented at meet- 
ings of the American Personnel and Guid- 
ance Association, 1959. This research was 
partially supported by the National Science 
Foundation and the Old Dominion Founda- 
tion. The author is indebted to John L. 
Holland and Laura Kent for their editorial 
assistance and to Lindsey Harmon for mak- 
ing available prepublication tabulations of 
baccalaureate origins of doctorates awarded 
in the United States. 


Several investigators (Stuit, Helmstad- 
ter, & Fredericksen, 1956; Holland, 1957) 
point to one difficulty in interpreting the 
Knapp-Goodrich and the Knapp-Green- 
baum results: since no adjustments were 
made for the fact that some colleges get a 
higher proportion of talented students than 
others, we do not know whether the alumni 
of colleges rated high on these indexes ex- 
hibit more achievement because of their 
undergraduate training or because of their 
initial superiority in aptitude. The present 
study attempts to resolve this ambiguity 
by making adjustments in order to equate 
colleges with respect to student quality. 


DEVELOPMENT OF CRITERIA 
FOR ACHIEVEMENT 


Because of the reluctance of many col- 
leges to release data on the aptitude of 
their entering classes, it has been difficult 
to make the required adjustments. Our ap- 
proach was to develop an approximate in- 
dex of student quality and to validate it 
for those colleges on which we could obtain 
aptitude data. If the index is valid for these 
colleges it may be used to describe other 
colleges as well. The first two National 
Merit programs provided records of the 
college enrollments of over 9,600 talented 
students. For each of 511 colleges we cal- 
culated the percentage of the freshmen class 
who were Merit Scholars or Certificate of 
Merit winners in the Merit program during 
the preceding spring. This percentage for 
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1956 freshman enrollments is referred to as 
the 1956 Talent Supply Index. For 39 
men’s colleges, the correlation between this 
index and the mean Scholastic Aptitude 
Test score of entering 1956 classes was .74, 
and for 43 women’s colleges the correlation 
was .76. These estimates of validity are 
probably conservative, since these colleges 
were all College Board colleges and so 
tended to enroll superior students. The 
Talent Supply Index probably has a 
validity of at least .80 for the wider range 
of talent. To increase the reliability of this 
measure of student quality, the two talent 
supply indexes for the years 1956 and 1957 
were added.? 

If it is assumed that the calibre of the 
college’s student body remains relatively 
constant over a period of years, we may use 
our index to estimate student quality for 
the period during which Knapp’s alumni 
groups were in college. By correlating the 
composite Talent Supply Index with the 
Knapp-Goodrich and Knapp-Greenbaum 
productivity indexes, it is possible to esti- 
mate the magnitude of the error introduced 
by ignoring diversities in student quality. 
The Knapp-Goodrich index of science pro- 
ductivity correlates .38 with the Talent 
Supply Index, a figure very close to the 
correlation of .39 reported by these authors 
between their index and ACE aptitude 
scores for a small sample of 50 colleges. The 
Knapp-Greenbaum indexes correlate .71 
and .64 with our Talent Supply Index. 
Thus, variations in student quality appear 
to account for 40 to 50% of the variance 
in the Knapp-Greenbaum indexes, and for 
approximately 15% of the variance in the 
Knapp-Goodrich measure. It seems impera- 
tive to control diversities in student quality 

2 The correlation between the two indexes 
for the 511 colleges was .92, indicating a fair 
degree of stability in talent supply. The 
correlation between mean ACE scores of 
entering freshmen at 19 Minnesota institu- 
tions of higher learning, over the seven-year 
period 1947-1954, was .80 (data from per- 
sonal communiction, W. R. Layton, Uni- 
versity of Minnesota). 


if we intend to use college productivity 
rates as measures of educational effective- 


ness. 

By assuming stability in college talent 
supplies over the years, we may partial out 
the effects of diversities in student quality. 
Our productivity index for the natural 
sciences is defined as the discrepancy be- 
tween a college’s expected rate of producing 
natural science Ph.D.s (as predicted from 
its enrollment of talented students) and its 
actual rate of productivity. Given the cor- 
relation table relating the college’s talent 
supply (X) and the percentage of the 
college’s graduates who earn doctorates 
(Y), the residuals in predicting Y from X 
give an index of productivity independent 
of student quality. Productivity indexes 
were then computed for the 511 colleges; 
these colleges enrolled about 70% of all 
freshmen entering degree-granting colleges 
in 1956 and 1957. 

Since this analysis suggested the need to 
treat scholarly and scientific fields sep- 
arately, we have developed two produc- 
tivity measures—one for the natural 
sciences and one for the arts, humanities, 
and social sciences. For convenience, the 
former is referred to as the NS, and the 
latter as the AHSS, index. The records of 
doctorates granted which we used were 
those published by the National Academy 
of Sciences, National Research Council, for 
the period 1950-56. A more complete de- 
scription of sources of information used in 
calculating the productivity indexes is given 
in a previous report (Thistlethwaite, 1959). 
Since the median lapse between the bac- 
calaureate and the science doctoral degrees 
is about seven years, we are dealing with 
alumni who graduated from college during 
the period 1943-49. Thus we are extra- 
polating backwards over a 10- to 15-year 
period. 

Because of these extrapolations, the in- 
dexes lack sufficient reliability te permit 
comparisons of individual colleges. How- 
ever, several lines of evidence suggest that 
the indexes are sufficiently valid for group 
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comparisons. First, according to students’ 
reports, there is a clear difference between 
productive and unproductive colleges in 
their emphasis upon preparing for graduate 
study. Second, the measures seem to be 
sensitive to differences in college objectives. 
Professional and technical schools rank 
highest in NS productivity and lowest in 
AHSS productivity. Similarly, of students 
attending colleges ranking high in AHSS 
productivity, 96% report that the college 
“library is exceptionally well equipped with 
journals, periodicals, and books in the 
social sciences.” Of students attending 
colleges ranking low in AHSS productivity, 
only about 60% endorse this statement. 
Third, colleges located in the South tend 
to rank low on both measures, a finding 
which is consistent with the Knapp-Green- 
baum results. 
Measures OF COLLEGE PRESS 

To identify student cultures and faculty 
characteristics which motivate students to 
seek the doctorate, student ratings of col- 
leges varying in productivity were com- 
pared. In this analysis, it is assumed that 
the environmental demands or pressures at 
the colleges in our sample have not varied 
greatly during the past 10-15 years. With 
this premise, it is possible to use current 
appraisals of the college environment as an 
estimate of the college atmosphere during 
the period in which our alumni groups were 
in college. 

The College Characteristics Index (Pace 
& Stern, 1958) was administered to 916 of 
our National Merit Scholars and Certificate 
of Merit 36 colleges. These 
students, who were sophomores at the time 
of the survey, were asked to judge whether 
each of the 300 statements in the CCI was 
probably true or probably false about their 
college. The average number of observers 


winners at 


per college was only 25, and these observer 


groups can hardly be thought of as repre- 
sentative of the entire student body, since 
they come from an exceptionally talented 
group. On the other hand, if a dominant 
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press really exists in a particular college, 
almost any group of students attending 
that college will probably recognize it. 
Fortunately, we had a diverse group of 
colleges. The 36 colleges included: Amherst, 
Brown (including Pembroke), California 
Institute of Technology, Carleton, Carne- 
gie, U. of Chicago, U. of Colorado, Cornell, 
Dartmouth, Duke, Georgia Institute of 
Technology, Harvard, Indiana, Iowa State, 
U. of Kansas, Massachusetts Institute of 
Technology, U. of Michigan, U. of Min- 
nesota, Northwestern, U. of Notre Dame, 
Oberlin, U. of Pennsylvania, Pomona, 
Princeton, Purdue, Radcliffe, Rensselaer 
Polytechnic Institute, Rice, Smith, Stan- 
ford, Swarthmore, U. of Texas, U. of 
Wisconsin, Wellesley, Wesleyan, and Yale. 


REsULTsS 


The student reports provide abundant 
evidence that college press differ con- 
siderably. Equally important, the press are 
consistent with our expectations. For ex- 
ample, Harvard and Radcliffe had the 
highest median scores on Humanism; MIT 
the highest on Scientism; Georgia Tech. 
and Rensselaer the highest on Pragmatism; 
Smith College the highest on Nurturance; 
and the University of Chicago the highest 
on Understanding. Clearly the CCI reflects 
differences in college atmospheres which are 
consistent with common belief. 

The correlations student 
achievement and some of the variables de- 


between 


scriptive of college environments were re- 
markablv high in view of the assumptions 
of the study. The college’s median score on 
each CCI scale was correlated with each 
of the productivity indexes. Table 1 sum- 
marizes the results for all scales which 
exhibited a correlation significant at the .01 
level with at least one of the productivity 
measures. Note that 12 of the 30 scales 
meet this criterion. The most discriminat- 
ing item in each of these 12 scales in shown 
in Table 1, together with the response 
which was weighted positively. The most 
striking feature of these results is that one 
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TABLE 1 
CorRRELATIONS BetTweEN CoLLEGE CHARACTERISTICS INDEX SCALES 
4 AND Propuctivity AT 36 CoLLEGES 
Correlation With 
i CCI Scale Most Discriminating Item on Scale Say 
NS AHSS 
Humanism Few students are planning postgraduate 
work in the social sciences. (F) — .23 
‘ Pragmatism Students are more interested in specializa- 
tion than in liberal education. (T) 15 —.77** 
i Reflectiveness Modern art and music get little attention 
‘here. (F) —.20 .69** 
; Sentience Student rooms are more likely to be deco- 
rated with pennants and pin-ups than 
with paintings, carvings, mobiles, fabrics, 
etc. (F) — .35* .68** 
| Harmavoidance Fire drills are held in student dormitories 
and residences. (T) — .23 
‘ Deference Religious worship here stresses service to 
God and obedience to His laws. (T) —.38* | —.50** 
Abasement There is a lot of apple-polishing around 
here. (T) -.1l — .47** 
yf Understanding There is a lot of emphasis on preparing for 
graduate work. (T) .18 
Scientism Few students are planning careers in 
A science. (F) .59** | —.03 
| Aggression-Blame- Students ask permission before deviating 
avoidance from common policies or practices. (F) 56** | —.24 
Impulsion-Deliberation | Students frequently do things on the spur of 
the moment. (T) 
Order Professors usually take attendance in class. 
(T) —.43** | —.18 
= 05, 
P = Ol. 


type of college environment is associated 
with achievement in the natural sciences, 
while a different kind of environment is 
related to accomplishment in the arts, 
humanities, and social sciences. Produc- 
tivity in the humanities is positively related 
to Humanism, Reflectiveness, Sentience, 
Harmavoidance, and Understanding. It is 
negatively related to Pragmatism, De- 
ference, and Abasement. Productivity in 
the natural sciences is positively related to 
Scientism, Aggression, and Impulsion, and 
negatively related to Order, Deference, and 
Sentience. 

Although there are obvious differences 
between the CCI scales which predict 


achievement in the two broad fields, the 
implications for college teachers and ad- 
ministrators are not clear from this anal- 
ysis. Part of the ambiguity arises from the 
fact that most of the CCI scales are com- 
posites of items descriptive of students and 
faculty. In Table 1, for example, the item 
taken from the Scientism scale—‘Few 
students are planning careers in science”— 
pertains to student behavior, while the 
item quoted from the scale called Order— 
“Professors usually take attendance in 
class’—described faculty behavior. The 
“composite” nature of these scales is well 
illustrated by Humanism—it has five items 
describing student behavior and values and 
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five items describing the faculty or ad- 
ministration. From these correlations we 
cannot tell whether faculty influences or 
student culture influences, or both, are re- 
lated to productivity. 

Therefore, another analysis was made 
based upon revised scales containing more 
homogeneous items. That is, a set of 
student press scales were devised from 
items in the CCI so that each scale in- 
cluded only those items descriptive of 
student values, interests, or behaviors. 
Similarly, we constructed a group of faculty 
press scales which included only those items 
descriptive of the college faculty or ad- 
ministration. Items which seemed to de- 
scribe the same trait and which exhibited 
the same pattern of correlations with the 
two productivity measures were grouped 
together. Since items which showed no re- 
lation to the achievement criteria were 
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discarded, the correlations to be reported 
should be interpreted with caution. In other 
words, these revised scales need to be cross- 
validated on a new sample of colleges. 
The correlations for the revised student 
press scales are given in Table 2. These 
correlations show the student attitudes, 
interests, and peer group norms which are 
related to achievement. In general, this 
analysis is consistent with the hypothesis 
that scientific and scholarly achievers 
thrive in different types of environments. 
Student cultures characterized by Hu- 
manism, Breadth of Interests, and Reflec- 
tiveness are associated with scholarly 
productivity, whereas cultures character- 
ized by Participation and Aggression are 
negatively related to scholarly produc- 
tivity. Motivation to seek the Ph.D. in the 
natural sciences, on the other hand, seems 
to be stimulated by student cultures which 


TABLE 2 
CorRRELATION Between Revisep StupENT Press ScALes 
AND Propuctivity at 36 CoLLEGES 


Student Press Scale 


Most Representative Item 
(Highest item-total score correlation) 


Correlation With 
Productivity In 


AHSS 


NS 


Humanism 


ete. (T) 
Breadth of interests 


Reflectiveness 


theologian. (T) 
Participation 


(F) 
Aggression 


Scientism 


Social conformity 


There is a lot of interest here in poetry, 
music, painting, sculpture, architecture, 


| Most students have very little interest in 
round tables, panel meetings, or other 
formal discussions. (F) 
There would be a capacity audience for a 
lecture by an outstanding philosopher or | 


| Student pep rallies, parades, dances, carni- | 
vals, or demonstrations occur very rarely. 


| Hazing, teasing, and practical joking are | 
fairly common. (T) 

| When students get together they seldom 
talk about science. (F) 

| Students think about dressing appropri- | 
ately and interestingly for different occa- 
sions—classes, social events, sports, and 
other affairs. (T) 


|_| 
af |. 
| 
= 
| — .07 
-.27 .67** 
.43°* | —.54** 
— .42° .03 
*P = 0. 
= 
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are high in Scientism and Aggression and 
inhibited by those which stress Social Con- 
formity. It could be, of course, that initial 
differences in student attitudes are partly 
reflected in these correlations. However, it 
seems reasonable to assume that different 
student cultures have considerable effects 
upon student achievement by virtue of the 
kinds of behavior they sanction. 

An equally important part of the college 
press consists of faculty practices and ad- 
ministrative policy. Table 3 shows some of 
the effects of these stimuli upon student 
achievement. The analysis confirms once 
again the view that one type of college press 
stimulates achievement in the natural 
sciences, while a different type facilitates 
achievement in the arts, humanities, and 
social sciences. Colleges outstandingly suc- 
cessful in encouraging undergraduates to 
get the doctorate in humanistic fields are 
characterized by (a) excellent social science 
faculty and resources, (6) a flexible, or 
unstructured, curriculum, (c) 
energy and controversiality of instruction, 
and (d) informality and warmth of student- 


somewhat 
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faculty contacts. At colleges high in natural 
science productivity, too, the faculty tends 
to have contacts with students character- 
ized by informality and warmth, but here 
the similarity ends. The latter are char- 
acterized by the absence of outstanding 
social science faculties or resources. The 
teachers tend to be nondirective in their 
teaching methods: for example, students 
find it relatively hard to predict examina- 
tion questions and to take clear notes in 
class; instructors less frequently outline 
explicit goals and purposes for courses; and 
students are not required to submit out- 
lines before writing term papers and re- 
ports. Finally, the Closeness of Supervision 
scale suggests that the faculty does not play 
the role of Big Brother: students need not 
sit in assigned seats and attendance is not 
taken; student organizations are not closely 
supervised to guard against mistakes; 
faculty members are tolerant and under- 
standing in dealing with violations of rules. 

The scale called Informity and Warmth 
of Student-Faculty Contacts is of special 
interest since it seems to predict achieve- 


TABLE 3 
CORRELATIONS BETWEEN Revisep Press ScaLes 
AND Propuctivity aT 36 COLLEGES 


Faculty Press Scale 


Excellence of social science | Course offerings and faculty in the so- 


Most Representative Item 
(Highest item-total score correlation) —_ 


Correlation With 
Productivity In 
NS. | AHSS 


faculty and resources cial sciences are outstanding. (T) | ~.42° | .83°* 
Flexibility of curriculum If a student fails a course he can usu- 
ally substitute another one for it 
rather than take it over. (T) — .31 .68** 
Energy and controversial- | Class discussions are typically vigorous 
ity of instruction | and intense. (T) —.13 .58** 
Informality and warmth of | Faculty members and administrators | 
student-faculty contacts | see students only during scheduled | 
| office hours or by appointment. (F) | .43** |  .40* 
Closeness of supervision Professors usually take attendance in | 
class. (T) |} —.38* | —.22 
Directiveness of teaching | Instructors clearly explain the goals 
methods and purposes of their courses. (T) — .42* | 24 
0. 
oP = Ol, 
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ment in both areas. The most representa- 
tive items of this scale tell us something 
about the behavior of the teacher who 
stimulates graduate study: he does not see 
students only during office hours or by ap- 
pointment; open displays of emotion are 
not likely to embarrass him; students need 
not wait to be called upon before speaking 
in class; in talking with students he fre- 
quently refers to his colleagues by their 
first names; students do not fell obliged to 
address him as “professor” or “doctor.” In 
other words, the stimulating teacher is con- 
siderate and does not encourage deference 
or abasement in his students. 

An additional analysis relates college 
press to hours of study. Though the num- 
ber of hours the student spends in study is 
* not a measure of achievement, one would 
generally expect to find more study in 
stimulating college environments. As a part 
of a related project we asked some 1900 
Certificate of Merit winners attending 35 
of the 36 colleges previously mentioned to 
estimate the average number of hours per 
week they spent in study outside the class- 
room. An analysis of variance indicates that 
there were highly significant differences in 
mean hours of study at these 35 colleges. 
The correlations between the press scales 
and the mean hours of study at each of 
these colleges are shown in Table 4. 

In general, the average number of hours 
of study of the college’s students was re- 
lated to AHSS productivity but unrelated 
to NS productivity. The correlations were 
+.48 and —.01 respectively. It seems that 
systematic study habits outside the class- 
room (and perhaps also outside the lab- 
oratory) make for the 
humanities but not in the natural sciences. 
Perhaps the two broad fields require dif- 
ferent types of study: the promising young 
scientist needs his laboratory to make dis- 


achievement in 


coveries, while the scholar depends more 


upon library resources. In any case, there 
is a striking similarity between the college 
press correlates of the AHSS and the hours 
of study measures. Students study more 
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TABLE 4 
CorRELATIONS Between Revisep PREsS 
ScaLes AND Hours or Stupy 


Correlation 
With Mean 
Hours of 
Study 


College 
Press 


I. Student Culture 
Reflectiveness 
Humanism 
Breadth of Interests 
Participation 
Aggression 24 
Scientism .20 
Social Conformity 14 

II. Faculty Press 
Flexibility of Curriculum 
Energy and Controversiality 

of Instruction 
Emphasis Upon High Aca 
demic Standards 
Excellence of Social Science 
Faculty and Resources 
Informality and Warmth of 
Student-Faculty Contacts 
Closeness of Supervision 
Directiveness of Teaching 
Methods 


.43** 


.42° 


*P = 065. 


outside the classroom, and are stimulated 
to obtain doctorates in the humanities and 
social sciences, primarily when the domi- 
nant student culture is characterized by 
Reflectiveness, Humanism, Breadth of In- 
terests, and by relatively little Participa- 
tion. Similarly faculty press which stimu- 
late graduate study in the humanities and 
social sciences also tend to encourage more 
study outside the classroom. Emphasis 
upon high academic standards is signifi- 
cantly related to hours of study, as it should 
be, although this press did not differentiate 
between colleges which are high and low in 
productivity. 


DiIscUSSION 


College press conducive to intellectual 
achievement are concepts of great theo- 
retical importance, since they help to 
organize research on higher education, Such 
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conceptulizations might suggest methods 
for improving higher education. The anal- 
ysis indicates that college faculties and 
student cultures play important roles in 
motivating undergraduates to seek ad- 
vanced training. The college press which 
encourage the scientist differ from those 
which inspire the scholar. There are, to be 
sure, some college characteristics—par- 
ticularly informal and friendly contacts 
between faculty and students—associated 
with achievement in both the natural 
sciences and humanities, but the differences 
are more striking than the similarities. 
Particularly noteworthy are those press 
scales which correlate positively with one 
productivity index and negatively with the 
other. 

Some of these correlations indicate a 
specialization of function in American col- 
leges. For example, the CCI scales called 
Humanism and Scientism, the revised 
student press scales having the same desig- 
nations, and the rating of the excellence of 
social science faculty and resources un- 
doubtedly reflect diversities in educational 
objectives. In addition, it is possible to 
discern an ethos of the college high in 
natural science productivity distinct from 
that of the college excelling in the produc- 
tion of potential doctorates in the humani- 
ties. The environment productive of natural 
scientists is characterized by student ag- 
gression, nonconformity, and commitment 
to science; the faculty tends to be nondirec- 
tive in teaching methods though adhering 
to strict curricular requirements. One 
student at a technological institute captures 
some of this quality in his comment: ‘The 
school tries very hard to flunk you out.” 
On the other hand, colleges high in AHSS 
productivity are characterized more fre- 
quently by policies which challenge the 
student without threatening him—this 
orientation is suggested by the CCI state- 
ment, “If a student fails a course he can 
usually substitute another one for it rather 
than take it over.” Similarly, energy and 
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enthusiasm seem more typical of teachers 
in these schools, a characteristic which 
other investigators have found to be related 
to student evaluations of teachers (French, 
1958). Students at colleges high in AHSS 
productivity, unlike their NS counterparts, 
are characterized by breadth of interests, 
greater reflectiveness, limited participation 
in campus antics, and limited expression of 
aggression toward faculty and fellow stu- 
dents. 

The present analysis clearly needs to be 
extended in many directions before we can 
explain why some educational environ- 
ments are more effective than others. Es- 
pecially promising from the standpoint of 
establishing causal relations are longi- 
tudinal studies which follow students 
matched for initial aptitude and career 
motivations through the undergraduate 
and postcollege years. Such studies are now 
being initiated among intellectually tal- 
ented students in the National Merit 
Scholarship program. 


SUMMARY 


The present report suggests that the 
college environment is an important de- 
terminant of the student’s motivation to 
seek advanced intellectual training. More- 
over, the student cultures and faculty press 
which stimulate achievement in the natural 
sciences appear to differ from those which 
stimulate achievement in the arts, humani- 
ties, and social sciences. New dimensions 
for assessing student cultures and faculty 
behaviors are described, and it is shown 
that these dimensions have promise as 
predictors of student achievement and of 
the amount ofstudy of the student outside 
the classroom. 
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- ABNORMAL PSYCHOLOGY AS A SELECTIVE FACTOR: 
A CONFIRMATION AND EXTENSION 


LEON M. WISE 
Heidelberg College 


There have often been allusions made by 
college students and others to the effect 
that students who register for a course in 
abnormal psychology are, or tend to be, 
more abnormal, neurotic, or poorly adjusted 
than one might expect to find by chance 
in the average classroom. One study (Mills, 
1955), which has already shed some light 
on the problem, compared students in an 
abnormal psychology course with students 
in a European history course. The general 
findings indicated a statistically significant 
difference between the two courses using 
the Munroe inspection technique. The 


author implied that psychology acts se- 
lectively on students in that, frequently, 
poorly adjusted students are attracted to 


psychology courses. 

It came to the present author’s attention 
that it might not be psychology, per se, 
which acts as a selective factor, but specific 
course content. That is, if the course con- 
tent deals mainly with the bizarre aspects 
of human behavior, this would tend to 
attract students in emotional conflict more 
readily than course content dealing with the 
more unemotional aspects of human be- 
havior. For example, it was felt that if 
psychology courses were to be compared 
with respect to morbidity of content and 
the emotional stability of their respective 
students, a relationship would be found to 
exist. In courses which might be termed 
nonpsychological in nature, emotional con- 
flict of students should be minimal, while 
in psychology courses, and especially ab- 
normal psychology, it should be maximal. 
The assumption here is that amount of 
morbid subject matter varies among the 
different psychology courses. 


PROCEDURE 

The present experiment was conducted 
in a small midwestern coeducation in- 
stitution, population 695. The Cornell In- 
dex, Form N-2, was selected as the instru- 
ment to measure emotional conflict or poor 
adjustment. This instrument measures the 
number of neuropsychiatric and psychoso- 
matic symptoms and is objectively scored. 
It was decided to compare students in four 
college courses: a communications course, 
N = 46; a general psychology course, VN = 
56: a child psychology course, N = 33; and 
an abnormal psychology course, NV = 36. 
These courses had approximately the same 
proportion of males and females. This was 
important owing to the fact that the norms 
for the Cornell Index show differences be- 
tween the sexes. For the most part, the 
communications course consisted of fresh- 
men, and the general psychology course 
consisted of sophomores. The abnormal and 
child psychology courses consisted mostly 
of sophomores, but did include some 
juniors and seniors. In addition, the general 
psychology course was a prerequisite for 
both child and abnormal psychology. How- 
ever, child psychology was not a pre- 
requisite for the abnormal psychology 
course. 

The Cornell Index was administered on 
the first class day of each course and the 
results compared. This was done to elimi- 
nate the possible biasing effects of being 
exposed to course subject matter. 


RESULTS AND DISCUSSION 
Analysis of variance yielded a statis- 
tically significant F ratio of 3.95 (p < .01), 
thus confirming the results of the Mills 
study (1955). 
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In addition to the statistically significant 
F ratio, however, the data, as shown in 
Table 1, lend support to the present 
author’s hypothesis stated previously. Ob- 
servation of the means shows the following 
trend. As the presumed amount of mor- 
bidity increased in courses, mean scores on 
the Cornell Index showed an increase. How- 
ever, only a small difference between the 
means of general and child psychology is 
evidenced. This might be an indication 
that there is relatively little difference in 
the amount of morbid or bizarre subject 
matter in the two courses. Based on this 
assumption ¢ tests were calculated in- 
dividually for all comparisons. 

The results as presented in Table 2 show 
that the only statistically significant ¢ was 
for the communications—abnormal psy- 
chology comparison. However, since the 
difference between general and child psy- 
chology was slight, these groups were com- 
bined and ¢ tests calculated. The t value for 
the difference between means of the com- 
bined groups and abnormal psychology was 
statistically significant (p < .05). The t 
value for difference between means of the 
combined groups and communications was 
1.96. For significance at the .05 level, a ¢t 
value of 1.97 is required. Thus, the latter 
probability was <.06 but >.05. This is 
interpreted as evidence to support the con- 
tention that abnormal psychology students 
are more maladjusted than general and 
child psychology students. Furthermore, 
general and child psychology students are 
probably more maladjusted than communi- 
cations students. 


TABLE 1 
MEANS AND STANDARD DevIATIONS 
or CoLLeEGt Courses 


Course 


Communications 
General psychology 
Child psychology 
Abnormal psychology 
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TABLE 2 
VaLues ror MEANS or DirreRENT 
Courses 


Courses df t Values p Values 


Communications vs. 
General psychol. 
Communications vs. 
Child psychol. 
Communications — vs. 
Abnormal psychol. 
General psychol. vs. 
Child psychol. 
General psychol. vs. 
Abnormal psychol. 
Child psychol. vs. Ab- 
normal psychol. 


Gen. & Child (com- 
bined) vs. Commu- 
nications 


Gen. & Child (com 


bined) vs. Abnormal 
psychol. 23 | 2.23 


These findings support the author's hy- 
pothesis and seem to indicate that psy- 
chology, per se, is not the most important 
selective factor in attracting poorly ad- 
justed students. Instead, the findings show 
that the study of bizarre or deviant be- 
havior is the more likely culprit. 

If it is permissible to generalize to other 
similar abnormal psychology courses in 
other institutions of higher learning, it 


_ might be well to reflect on the best method 
of teaching such a course. This has already 


been touched upon elsewhere (Mills, 1955). 
It is quite possible that students registering 
for such a course are not only interested in 
the subject matter as subject matter, or a 
grade, as may well be the case in other 
academic areas, but are also interested in 
getting some help and/or insight into the 
nature of their own personal problems or 
the problems of others. The present author, 
after interviewing students in an abnormal 
psychology course, found this possibility 
more of a reality suggesting that considera- 
tion should be given to the possibility of 
orienting the course, at least in part, in the 


= 

100 1.55 | >.05 
77 | 1.84 | >.05 
1 8O | 4.75 | <.01 
88 | .20| >.05 

2 90 | 1.93 | >.05 

67 | 1.92 | >.05 

133 | 1.96 | >.05 

| 

‘a — N | X | SD 

| 46 | 7.54) 6.42 

56 9.79) 7.79 

33 «10.09 5.42 

| 36 /12.92) 7.37 
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general direction of therapeutic-type 
teaching situation. That is, a secondary ob- 
jective might be added to the course in 
addition to the usual academic objective. 

Autobiography, as suggested by Brower 
(1947), offers the possibility of aiding in 
this respect. A group therapy approach 
offers still another possibility. Certainly re- 
assurance might be consciously given from 
time to time. In addition, alertness might 
be maintained for “loaded” questions. Re- 
gardless of method utilized, and it would 
undoubtedly vary considerably depending 
on circumstances, consideration should be 
given to the fact that students in abnormal 
psychology apparently desire something 
more than is customary in other academic 
areas. Additional research is needed to shed 
still more light on this problem. 

The usual practice of requiring psy- 
chology majors to take abnormal psy- 
chology raises an additional question. Are 
psychology majors more poorly adjusted 
than nonpsychology majors? The present 
data do not permit a conclusion one way 
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or the other. Additional study is necessary 
in order to shed light on this question. 


SUMMARY 


The Cornell Index was administered to 
four college courses varying in amount of 
morbid subject matter. They were: a non- 
psychology course, a general psychology 
course, a child psychology course, and an 
abnormal psychology course. A statistically 
significant F ratio was found (p < .01). In 
addition, a relationship between personal 
adjustment scores of students and mor- 
bidity of course subject matter was pointed 
out. It was concluded that psychology, per 
se, was not as important a selective factor 
as was specific course content. 
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THE RELATIONSHIP OF AUTHORITARIAN 
PERSONALITY TO LEARNING: 
F SCALE SCORES COMBINED TO CLASSROOM PERFORMANCE 


ANN FILINGER NEEL 


Wyandotte County Guidance Clinic, Kansas City, Kansas 


The laboratories in which learning 
theories are put to test need not concern 
themselves with the personalities of their 
experimental subjects. The subject, rat, 
cat, pigeon, or guinea pig, usually will press 
bars, or peck keys, or pull levers or salivate 
with, as far as we know, no prejudice or 
compunction about the bar or lever he 
uses, or the bell or food pellet to which he 
responds. But human learning becomes 
much more complex because the kind of 
person the subject is influences his reaction 
to the material he is to learn, and the way 
in which he goes about learning it. It will 
also influence what he learns, or at least 
what changes in performance he will mani- 
fest. In fact, it is this differential reaction 
which specifies and reinforces the person- 
ality of the subject. 

The present paper explores one facet of 
the interaction between personality and 
the process by which exposure to new data 
alters behavior. While the appearance of 
such alterations will ordinarily be regarded 
as sufficient evidence for learning, one can 
not altogether rule out the possibility that 
it represents variations in performance due 
to other factors. This will be brought out 
later in discussing some of the findings. 

The aspect of personality selected for 
examination was the so-called authoritarian 
personality (Adorno, Frenkel-Brunswick, 
Levinson, & Sanford, 1949). In addition to 
the conflicts over dominance and submis- 
sion, there are certain other characteristics 
hypothesized to be present in such a person- 
ality type. Such a person is presumed to be 
unable to tolerate ambiguity, preferring to 

1 Presented at the American Psychologi- 
cal Association, Washington, D. C., Sep- 
tember 1958. 


deal with well defined and well ordered 
material, and prone to imposing such order 
upon the world of his own accord if it does 
not exist in its own right. Once he has 
organized his perceptions or beliefs he is 
slow to change if he can do so at all. He 
responds to people as to any other stimulus, 
fitting them into moral categories rather 
than appreciating individual variations. 
He can not understand or empathize with 
others, and more than likely can not even 
like them. 

How might such a personality affect the 
learning process? In the first place, such a 
person, with his resentment and antago- 
nisms for others, should find it difficult to 
learn to achieve certain types of knowledge 
about and understanding of human be- 
havior. A person who can not understand 
or tolerate others would find it hard to 
achieve the attitudes of humanitarian in- 
structors. Learning in this area would be 
complicated by the lack of definitive struc- 
ture and established fact. Aside from 
difficulties with the material itself, the 
authoritarian person would probably find 
it uncomfortable and perhaps distasteful 
to be exposed to this type of content. Simi- 
lar complications would not be expected 
with more factual learning. 

These considerations generate the follow- 
ing hypotheses: 

1. The more authoritarian a person is, 
the more likely he is to have difficulty 
learning material which (a) deals with 
humanitarian philosophy or (6) is ambigu- 
ous. Such difficulty should not be evidenced 
where learning of factual material is in- 
volved. 

2. The more authoritarian a person is, 
the more dislike he should manifest for 
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materials involving ambiguous or humani- 
tarian materials. 


SUBJECTS AND PROCEDURE 

Subjects. The persons employed as Ss were 
30 male senior medical students taking : 
required class in psychiatry. This class 
included experience in evaluating and treat- 
ing persons with emotional problems. The 
emphasis was placed on the type of prob- 
lems met with in general medical practice, 
not on psychiatry per se. In addition to 
patient there were 
seminars and discussions regarding therapy 


contacts, numerous 
and diagnosis. 

Assessment of authoritarian personality. 
The group of students was given the F 
scale (Adorno, Frenkel-Brunswick, Levin- 
son, & Sanford, 1949) during one of their 
class periods. They were told this was a 


study by psychologists investigating social 


attitudes among medical students with 
reference to the kind of speciality they 
intended to enter. They were asked to put 
their names on the attitude scales so that 
their scores could be correlated with their 
eventual choice of fields, 

Assessment of learning. During the course, 
the students were given several examina- 
tions. One of these was considered to be a 
test of factual information, although it is 
not a pure culture test thereof since it in- 
volved the necessity of translating facts 
into behavior. The test is reproduced here: 


Your patient is a young married woman 
who is very hostile toward her husband but 
can not accept this. Describe the behavior 
and/or remarks by which she might demon- 
strate the following defenses against her 
conflict: Repression, Reaction Formation, 
Fantasy, Withdrawal, Displacement, Pro- 
jection, Intellectualization, Suppression. 


A maximum score of 10 points could be 
earned on each item. The tests 
scored by three of the staff, each answer 
being discussed by the group before a 


were 


score Was assigned. 
The other quiz employed here presented 
fictitious case situations for the student to 
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handle. It demanded that the S deal with 
a far more ambiguous area, where there are 
no set rules. The first question of this test 
was set up with special reference to the 
hypothesis regarding humanitarian phi- 
losophy. Much time had been spent during 
the class stressing the lack of success in 
treating and changing certain types of 
people and the consequent necessity for 
manipulation and 


environmental social 


support. 


1. You are in general practice in a me- 
dium sized town. A 37-year-old man comes 
to you as a private patient. He is married 
and has three small children. He presents a 
history of a minor back injury 12 years ago 
and has had vague complaints of back pain 
since that time. He states that for this rea- 
son he is unable to work and has been un- 
employed most of the time for four years. 
Job history prior to the injury was marginal, 
odd jobs which produced irregular minimal 
income. The family has been receiving gen- 
eral assistance from the County Welfare 
Department at those times the patient was 
unemployed. The patient has now applied 
for permanent and total disability assist- 
ance on the basis that he has been to many 
physicians and, he says, none have helped 
him. Your complete medical examination is 
negative. You estimate his intelligence to 
be slightly below normal. There is no evi- 
dence of a psychotic process. He stated the 
world was too much for him and he had no 
plans or concern for the future. What would 
your diagnostic impression be? What is your 
philosophy about the community’s respon- 
sibility for the man and his family? 

2. You are in general practice and a 24- 
year-old woman, married six months, comes 
to you with the complaint of frigidity. Your 
physical examination is negative. She asks 
for a good book on sex. How would you 
handle the situation and why? 

3. You are in general practice and you 
are treating a 55-year-old lady who has been 
successfully employed as a bookkeeper. 
Since a cerebral vascular accident about 
six months ago, she has been complaining 
of memory difficulty and inability to per- 
form her job as in the past. Psychological 
testing reveals moderate organic damage 
but sufficient ability to perform a less de- 
manding job. She is not responding satis- 
actorily under medical treatment and ex- 
presses concern about herself and her job. 
In view of this, you feel there are psychiat- 
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ric aspects to the problem. How would you 
handle the situation and what would your 
goals be? 


For each question, the staff formulated 
points which the students should cover and 
then discussed each answer and assigned a 
score between 1 and 5. 

Assessment of student reaction of course. 
During the final class period, the group was 
invited to evaluate the class and discuss 
their reactions to it. This was routine pro- 
cedure, and the students were well aware 
that such an opportunity was to be given. 
Most of the students were able to express 
themselves freely. Obviously some of them 
censored their remarks for fear of getting 
into trouble, but the fact that graduation 
was already assured served to relieve them 
to some degree. The discussion was recorded 
by the investigator? who was apparently 
making notes for use in planning future 
classes. Those remarks were later coded 
for the following criteria: 


I. Structure-Ambiguity continuum: re- 
marks concerning a desire for structure 
through lecture, better organization, criti- 
cism of class for vagueness, etc. were 
scored +1; neutral remarks were scored 0; 
and remarks concerning a desire for flexible 
unstructured program were scored —1. 

II. Attitude toward Psychiatric Patients: 
remarks reflecting a dislike of patients, a 
rejection of patients, or a feeling that pa- 
tients did not like them were scored 1; neu- 
tral remarks were scored 0; and like or don’t 
mind patients scored —1. 

III. Attitude toward Psychiatry: re- 
marks of dislike, unnecessary, etc. scored 
+1; neutral remarks scored 0; and remarks 
which were propsychiatry, such as “want’’ 
or ‘“‘need for medical practice’ were scored 
-1. 

IV. Evaluation of own ability to deal 
with material: remarks of ‘“‘weak’’ or “‘in- 
adequate’’ scored +1; neutral remarks 
scored 0; remarks of ‘‘adequate’’ scored —1. 


All unscorable remarks were recorded as 
0. The code values were added algebraically 
to get a total score. 


? The investigator did not know the Ss’ 
F scale scores at the time of these ratings. 
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No effort was made to assess intelligence, 
since it was believed that students who had 
survived college and four years of medical 
school should have sufficient and fairly 
uniform ability to learn. 


RESULTS AND Discussion 

Table 1 gives the chi square comparison 
of the F scale score and the score obtained 
on the test question regarding humani- 
tarian philosophy. If the relation predicted 
by Hypothesis 1 does would 
expect those persons above the median on 
authoritarianism to score below the median 
of the question. Such is the case. Thus it 
does appear that a person’s general feeling 
about people, his social philosophy, deter- 
mines or at least is related to whether or 
not the person can or does learn and/or 
use material relating to humanitarianism. 
(It may be, of course, that there is no 
difference in learning. The observed dif- 
ferences could be the results of variations 
in performance due to variables other than 
learning. Just how one can bring out the 
existence of learning which did not affect 
behavior is, however, a question which 
haunts the operational minded investiga- 
tor.) 

The comparison of F scale and the total 
score on the ambiguous test is given in 
Table 2. Again, the prediction is borne out. 

The hypothesis holds that more factual 


exist, we 


TABLE 1 

Cut SquaRE CoMPARISON OF Score on F 
ScaLe witH Score ON Question Re- 
GARDING HUMANITARIAN PHILOSOPHY 


F Scale 
Median 71 


Md. or 


Below Total 


Above 
a. 


Humanitarian Ques- 
tion Md. or Below 
Above Md. 


Total 
x? = 6.45. 
p = .005 (one-tailed test) 
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TABLE 2 
Cu Square ComMPaRISON oF Score on F 
ScALE WITH ScORE ON 

“Ampiauous”’ Test 


F Scale 
Median 71 

Ma r| Abov 

Total 
Ambiguous Test Md. 

or Below 4 8 12 
Above Md. 14 | 4 18 
Total 18 12 30 


x? = 6.68 
p = .005 (one-tailed test). 


TABLE 3 
Cut SQUARE COMPARISON OF SCORE ON THE 
F ScaLe WITH SCORE ON 
“FactuaL”’ Test 
| F Scale 
| Median 7: 
| 


Md. or | Above 


Below Md. Total 

Factual Test Md. or 
below | ¢ 8 15 
Above Md. | 6 2); 8 
Total | 10 238 


x? = .71 (corrected for continuity). 

p = 40 

"Seven Ss did not take the factual test. Of these 
seven, four were below the F scale median, three above 
it. 


material should not be subject to this 
differential effect. Table 3 gives the com- 
parison of F scale and “factual” test. Here 
the chi square is not significant, as the 
hypothesis would predict. 

The final prediction was that the more 
authoritarian person should be more un- 
comfortable in the class and dislike the 
subject matter. The student’s reaction to 
the course was assessed by coding his 
evaluative remarks regarding the class 
experience. The prediction would call for 
persons above the F scale median to be 
above the remark median. Table 4 reveals 
that while the trends are in the predicted 
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direction, the results fail to reach an accept- 
able level of significance. This may have 
occurred because of a tendency for the 
students to soften their remarks or to say 
what was expected (this would be expected 
of authoritarian persons). Eight students 
made no remarks, resulting in an extremely 
small N. Also, the prediction might be 
dulled by the fact that many evaluative 
comments were quite accurate and realistic. 

Actually it was the students’ remarks 
about the class which initiated the study. 
Bitter complaint had been made by the 
students to various staff members, and 
some had voiced extreme dislike of course 
and subject matter. The greatest area of 
unhappiness had been the seminar or dis- 
cussion-group nature of the class and the 
fact that nothing was ever “tied together” 
for them. This was brought up on a day 
when case presentation had been particu- 
larly organized and definitive. This, plus 
many other efforts to meet student criti- 
cism with lack of success, suggested that 
something must be interfering with the 
class other than the usual problems of 
technique, subject matter, etc. In view of 
these observations, and the suggestive 
trends in the data, it seems that the hypoth- 
esis deserves another, more sensitive test 
before being discarded. 


TABLE 4 


Cut SQuaRE COMPARISON SCORE ON 
F ScaLe 
To CLaAss 


F Scale 
Median 71 
| 
Reactions Md. or be- 
low 10 | 4 14 
Above Md. 3 5 8 
Total | 1 | 9 


x? = 1.22 (corrected for continuity). 

p = .15 (one-tailed test). 

* Eight students made no evaluation remarks. Of 
these eight, five were below the F scale median, three 
were above it. 
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Returning to the general problem of the 
authoritarian personality on learning, one 
student’s remark regarding a case presen- 
tation illustrates the hypotheses of this 
study perhaps better than all the interpre- 
tive comments so far given. “I would like 
a classical example. The staff should make 
a diagnosis. If it turns out to be something 
else, turns out wrong, O.K., but it’s some- 
thing to remember. Then when I see a 
patient in practice I can say ‘I saw one 
like you!’ It is so confusing now.” 


SUMMARY 


It was hypothesized that the more 
authoritarian a person is, the more likely 
he would be to have difficulty (a) in learn- 
ing material which involved humanitarian 
philosophy and the need for understanding 
people, (b) in mastering ambiguous mate- 
rial which required him to think on his 
own, but (c) not in learning of factual sub- 
ject matter. The authoritarian person 
should be more uncomfortable with, and 
more likely to state dislike for ambiguous 
humanitarian, and psychological subjects, 
and/or a democratic teaching atmosphere. 
Thirty male seniors in a medical school 
taking a required course in psychiatry were 
given an F scale, the scores on this being 
compared by means of chi square with 
scores on a test question dealing with social 
philosophy regarding indigent persons, with 
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an examination dealing with handling case 
situations (“ambiguous” material), and a 
quiz over factual matter. The student’s 
evaluative remarks regarding the class were 
used as an index of their discomfort and 
dislike of the class. The hypotheses regard- 
ing humanitarian ambiguous and factual 
material were validated. The prediction 
regarding dislike and discomfort ap- 
proached significance, but because of the 
small N, and reality factors, it failed to be 
satisfactorily demonstrated. Further test- 
ing would seem to be warranted. It may be 
concluded that the more authoritarian a 
person is, the more difficulty he will have 
in learning psychological material. 

On the basis of the present findings, it 
would seem reasonable to anticipate that 
authoritarian persons would experience 
difficulty in mastering material with hu- 
manitarian content. One would predict 
that this same difficulty would appear in 
other areas of social science, especially 
where theoretical rather than factual 
materials are utilized. Research is 
being conducted to test this prediction. 


now 
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DIAGNOSTIC RATING OF TEACHER PERFORMANCE! 


DON J. COSGROVE 


Procter & Gamble Company, Cincinnati, Ohio 


The present article outlines a new 
method for evaluating the effectiveness of 
the teaching performance of secondary 
school and college instructors. The method 
is a modification of the forced-choice tech- 
nique, and the paper shows how it might 
be used to produce diagnostic profiles val- 
uable for alerting a teacher as to what his 
students consider to be the relatively 
strong and weak aspects of his perform- 
ance. 

Diagnostic application of rating prin- 
ciples was begun by grouping phrases de- 
scriptive of teacher performance into cate- 
gories according to similarity of behavioral 
the grouping later being verified 


reference 
by statistical means. Use of these grouped 


phrases in a rating scale allows for totaling 
category scores which in turn may be pre- 
sented in “diagnostic” or “descriptive” 
profile form. The approach stems from 
original work of Wherry (1950) at Ohio 


State University. 
DEVELOPMENT OF THE Ratina Form 


Initial Selection of Phrases 


The first step in the development of the 
teacher rating form was the selection of a 
pool of phrases which described specific 
elements of teacher performance. A list of 
900 descriptive phrases is given in the re- 
port of a study of the control of bias in 
rating carried out under the direction of 
Wherry (1950) at Ohio State under the 
sponsorship of the Personnel Research See- 
tion of The Adjutant General’s Office. One 
hundred ninety-six phrases were selected 
from this list because of their pertinence 


1This article is taken from a doctoral 
dissertation completed at Ohio State Uni- 
versity. The writer acknowledges the valu- 
able assistance of Robert J. Wherry on this 
project and thanks him for it. 


to the general teaching methods used in 
the educational psychology program in 
which the present study took place. Four 
additional phrases were composed to bring 
the total to 200. All of the selected phrases 
were short descriptions of teacher behavior, 
and all were of a positive or favorable na- 
ture. Those taken from the AGO report 
were originally gathered from paragraphs 
written by educational psychology students 
who were asked to write short essays de- 
scribing “good” teachers. Each of the 200 
phrases was written on a plain white 3 x 5 
index card, one phrase to a card. 

Five instructors in the educational psy- 
chology program, plus an additional gradu- 
ate student in psychology, independently 
sorted the phrases into five specified sub- 
areas of general teacher performance. The 
subareas were preselected from a larger 
group suggested in the AGO study men- 
tioned earlier and bore the following titles: 


. Mastery and Organization of Subject 
Matter 

. Skill in the Control and Discipline of 
Students 

. Reasonableness of Demands on Stu- 
dent Time and Effort in View of Help 
and Directions Given 

. Efficiency and Logicality of Classroom 
Management Procedure 

}. Skill in Motivating, Inspiring, and 
Creating Confidence in Students 


The 150 phrases on which there was most 
agreement by sorters were retained. 


The Descriptive Check List 


The 150 descriptive phrases were next 
presented in the form of a descriptive check 
list to 100 educational psychology students 
at Ohio State University. The students 
were divided into three Each 
member of the first group was asked to call 
to mind some previous instructor to whom 


groups. 
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he could give a particular rank between 1 
and 10 on an over-all scale of 1 (worst) to 
20 (best), among a typical group of 20 in- 
structors. Students in group 
were asked to choose instructors who could 
be ranked between 11 and 20 on the same 
scale, and members of the third group were 
asked to think of instructors to whom they 
could assign ranks between 9 and 16 on 


the second 


such a scale. The grouping procedure was 
followed to increase the probability of a 
norma! distribution of ranks being assigned 
to the 100 instructors. 

Each student was then asked to consider 
the 150 phrases, one at a time, as they 
applied to the instructor he had in mind 
and to assign a value to each phrase indi- 
cating the degree to which he thought the 
phrase described the teacher. A value of 5 
indicated great applicability, and a value 
of 1, little applicability, while values 2, 3, 
and 4 represented intermediate points along 
the scale. Thus, if a student thought the 
phrase, “Contagious enthusiasm for sub- 
ject,” applied very well to the instructor, 
he blackened space No. 5 for that phrase 
on an IBM answer sheet. If a student 
thought this phrase had little applicability 
to the instructor, he filled in space No. 1. 

The descriptive check list responses fur- 


nished all the raw data required for com- 


puting scale values for the phrases and 


selecting those to be used on the final rat- 


ing form. 


Preference Indices 


A preference index was established for 
each phrase by computing the mean appli- 
cability value given the phrase on the 100 
check lists. This index expressed the degree 
of willingness of the students to apply a 
phrase to the behavior of the instructors 
being considered Preference indices pro- 
vided the first basis for assigning phrases 
to sets in the final rating form 


Discrimination Indices 


A second index, the discrimination index, 
was computed for each phrase by corre- 
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lating applicability values assigned the 
phrase and the ranks (from 1 to 20) given 
to the instructors. The discrimination index 
provided a measure of the degree to which 
the behavior described by a phrase was 
related to successful teaching. 


Factor Analysis of the Descriptive Phrases 


Since all the phrases included on the 
descriptive check list had previously been. 
sorted into five categories or subareas of 
teacher activity, it was possible to add the 
applicability values given to all phrases 
within each category to obtain category 
scores. That is, each of the answer sheets 
produced a score for each subarea, giving 
a total of five scores. 

Beginning with the correlations between 
the scores for the five subareas and the 
correlation of each of the phrases with each 
of the subareas, a factor analysis of the 
items was achieved. The general procedure 
was a modification of a method for factor- 
ing large numbers of items reported by 
Wherry and Winer (1953). This method 
was used because it avoided the necessity 
of correlating each item with every other 
which would result in a matrix of unman- 
age able size. 

Analysis of the items resulted in the ex- 
over-all 
two subgenerals and 
four specific factors. The use to which 
these results were put will be shown in the 


traction of a general factor of 


teacher effectiveness, 


next section. 


Final Rating Form 


All the for the 
construction of the final rating form was 


information necessary 


now «available. The plan was to group 
phrases into 10 sets of four phrases each 
Each phrase within a set was to have a 
significant loading on a different one of 
the four specific factors. The task of the 
rater was then to rank the phrases within 
each set as they described the teacher being 
considered, 

Grouping phrases into sets was done 


first in terms of preference indices. The 
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aim was to have phrases in each set which 
a student would be equally willing to use 
in describing a teacher. That is, the four 
phrases should have similar preference in- 
dices. 

It was also desirable for the phrases to 
have similar discrimination indices. This 
provided assurance that all phrases in a set 
were about equally related to general teach- 
ing effectiveness. It was hoped, too, that 
phrases within a set would, as nearly as 
possible, have equal loadings on the general 
factor of teaching effectiveness. Of course, 
each phrase had to have a significant load- 
ing on a different one of the four specific 
factors and negligible loadings on the 
others. Ten sets of phrases were thus 
chosen. This meant that 40 of the pool of 
150 phrases were used. 

The titles given to the four specific fac- 
tors of teacher performance as suggested 
by the phrases chosen to represent them 
were: 


1. Knowledge and Organization of Sub- 
ject Matter 

2. Adequacy of Relations with Students 
in Class 

3. Adequacy of Plans and Procedures in 
Class 

4. Enthusiasm in Working with Students 


The form is called “The Descriptive 
Ranking Form for Teachers” and is shown 
here in its entirety. A number follows each 
phrase to indicate the subarea to which 
the phrase belongs. These numbers do not 
appear on the form when it is used. 


This form consists of 10 sets of phrases 
which are descriptive of instructor perform- 
ance. Each set is composed of four phrases. 

Please consider the instructor you have 
for this course. In each set of phrases, rank 
the phrases from 1 to 4 as they apply to your 
instructor. Give a rank of 1 to the phrase 
which most applies, and 4 to the phrase 
which least applies, using Ranks 2 and 3 for 
the intermediate ranks. Every phrase must 
be ranked. There can be no equal ranks. 
Set a 
— Always on time for class [3] 


— Pleasant in class [2] 
— Very sincere when talking with students 


[4] 

— Well-read [1] 

Set b 

— Contagious enthusiasm for subject [4] 

— Did not fill up time with trivial material 
(3) 

— Gave everyone an equal chance [2] 

— Made clear what was expected of stu- 
dents [1] 

Set c 

— Classes always orderly [3] 

— Enjoyed teaching class [4] 

— Friendliness did not seem forced [2] 

— Logical in thinking [1] 

Set d 

— Encouraged creativeness [4] 

— Kept course material up to the minute 
(1) 

— Never deliberately forced own decisions 
on class [2] 

— Procedures well thought out [3] 

Set e 

— Authority on own subject [1] 

— Friendly attitude toward students [4] 

— Marked tests very fairly [3] 

— Never criticized in a destructive way [2] 

Set f 

— Good sense of humor [4] 

— Spaced assignments evenly [3] 

— Students never afraid to ask questions 
in class [2] 

— Well organized course [1] 

Set g 

— Accepted students’ viewpoints with open 
mind [2] 

— Increased students’ vocabulary by own 
excellent usage [1] 

— Students always knew what was coming 
up next day [3] 

— Students willingly worked for teacher [4] 

Seth 

— Always knew what he was doing [3] 

— Appreciated accomplishment [4] 

— Did not ridicule wrong answers [2} 

— Well informed in all related fields [1] 

Set 

— Always had class material ready [3] 

— Covered subject well [1] 

— Encouraged students to think out an- 
swers [4] 

— Rules and regulations fair [2] 

Set j 

— Always managed to get things done on 
time [3] 

— Course had continuity [1] 

— Made material significant [4] 

— Understood problems of students [2] 
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TABLE 1 
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Low High 
10 15 20 25 30 35 40 


. Knowledge and Organization of Subject Matter 


. Adequacy of Relations with Students in Class 
. Adequacy of Plans and Procedures in Class 


. Enthusiasm in Working with Students 


The form is scored by adding across sets 
all the ranks given to the phrases of each 
subarea. In scoring, the rank values are 
reversed. A phrase ranked 1 is scored 4, a 
phrase ranked 2 is scored 3, and so on. 
For diagnostic purposes these subarea 
scores may be put into profile form, clearly 
showing comparative subarea ratings given 
the teacher. Such a profile is shown in 
Table 

Higher subarea scores mean higher rank- 
ing of the phrases of the subarea. On the 
profile in Table 1, phrases of Area 3, Ade- 
quacy of Plans and Procedures in Class, 
were ranked highest and phrases of Area 4, 
Enthusiasm in Working with Students, 
were ranked lowest. 

An individual’s standings on the sub- 
areas are relative to one another and not 
to any external standard. The profile does 
not indicate how a teacher compares with 
other teachers in general effectiveness. 
What it does show is the relatively strong 
and weak areas of his own performance, so 
that he will know where he might best 
start to increase his teaching skill. 

The form was not validated against any 
sort of external criteria of performance in 
the four areas. In regard to validity, how- 
ever, it should be pointed out that the 
content of the individual phrases in the 
form is relevant to teacher effectiveness. 
The discrimination indices of the phrases 
were acceptably high, the median being .46, 
with Q; at .39 and Q; at .53. As was stated, 
the discrimination index indicated the de- 


gree of relationship between the behavior 
described by a phrase and teacher effec- 
tiveness as seen by students. 

The Descriptive Ranking Form was used 
by students to rate their instructors in a 
basic educational psychology course at 
Ohio State University. Ratings took place 
in 12 separate class sections of the course, 
with 8 to 12 students in each section using 
the form. The forms completed in each 
section were randomly divided into two 
equal groups. By averaging individual area 
scores resulting from the forms used by 
each of these small student groups, two 
profiles resulted for each class section. 
Cattell’s (1949) shape correlation coefficient 
was used to express the similarity of the 
shape of the two profiles of each of the 12 
pairs. The extent of this similarity provided 
a rough indication of the amount of agree- 
ment which existed between two groups of 
students using the form to rate the same 
teacher. Pooling the 12 shape correlation 
coefficients thus computed resulted in a 
coefficient of .74. 

The advantage of the profile approach 
is that a teacher (and his supervisor) can 
consider his behavior in terms of really 
pertinent dimensions. The typical analysis 
of ratings involves a consideration of in- 
dividual item responses, and often results 
in a hodge-podge conception of the com- 
petence of a teacher. And the longer the 
rating form the more difficult it is to gen- 
eralize from it. The profile approach, on 
the other hand, presents a more organized 


a3 
|__| 
From Use or Tue Descriptive RaNkiInG Form ror TEACHERS 
| Scores 
| 
| 
2 
3 
a 
> 


204 


and useful analysis of the rating, giving 
the teacher a clearer idea of the relatively 
strong and weak areas of his performance. 
The forced-choice format also minimizes 
bias, which so often is evidenced in typical 
graphic rating schemes. 


SUMMARY 


A method has been presented for evalu- 
ating the relative effectiveness of a teach- 


er’s performance in four areas of activity. 
Phrases descriptive of teacher behavior 
were grouped in sets of four, each phrase 
representing a different one of the areas. 
In completing the form, students rank the 
phrases of each set as they apply to the 
instructor. The end result is a profile show- 
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ing the relative standing of the instructor 
on the four areas of teacher activity. This 
evaluation system will be useful to the in- 
structor who wants to know where to begin 
work on improving his effectiveness. 
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LEARNING WITHOUT AWARENESS AND TRANSFER 
OF LEARNING SETS! 


JULIUS M. SASSENRATH 
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Among the many factors involved in a 
complex learning process, educational and 
experimental psychologists have discussed 
the importance of learning sets or learning 
how to learn (Buswell, 1956; Harlow, 1949). 
One of the questions confronting the study 
of human learning sets is whether or not 
learning sets are at times acquired without 
subject’s (S’s) awareness of what he is 
learning. 

Although not overlooking the importance 
of the learner’s drives in modulating the 
effectiveness of rewards, Thorndike (1935, 
p. 40) hypothesized that the consequences 
of S-R connections could improve perform- 
ance without S understanding what he was 
learning. Attempts to explore this hypothe- 
sis gave rise to the problem of learning 
without awareness (hereafter LWA). To- 
day, many experimenters (Es) operation- 


allv define awareness as a correct verbaliza- 
tion by S of what he is learning (Adams, 
1957; Postman, 1955). Any reliable increase 


in performance prior to awareness is con- 
sidered to be LWA. Since S’s performance 
can increase as a function of partially cor- 
rect verbalizations for his responses, some 


1 This article is adapted from a Ph.D. 
dissertation entitled: Transfer of learning 
sets employing “learning without aware- 
ness”’ procedures, submitted to the Univer- 
sity of California, Berkeley. The author 
would like to thank Guy T. Buswell, Harold 
D. Carter, Rheem F. Jarrett, Jack A. 
Holmes, and Bert Y. Kersh for helpful sug- 
gestions and criticisms. Some of the material 
was presented in a paper at a joint meeting 
of the American and California Educational 
Research Associations in San Francisco, 
March, 1958. The data were reanalyzed dur- 
ing the author’s tenure as a Public Health 
Service Research Fellow of the National 
Institute of Mental Health. 

2 Now at Indiana University. 


studies of LWA have been questioned by 
Adams (1957). 

Learning set has been defined as a learn- 
ing how to learn a general type of problem 
(Harlow, 1949). Harlow, in his work with 
monkeys and preschool children, not only 
found evidence for a learning set on one 
type of problem but, also, transfer of 
learning set to a series of reversal problems. 
These data prompted the suggestion that 
“the learning set delivers the animal from 
Thorndikian bondage.” Indeed, studies on 
LWA, which employed a systematic prin- 
ciple for administering reinforcement in a 
Thorndikian situation (Hirsch, 1957; Irwin, 
Kaufman, Prior & Weaver, 1934; Philbrick 
& Postman, 1955; Postman & Jarrett, 
1952; Thorndike & Rock, 1934), also have 
found that adult human Ss could be de- 
livered from “Thorndikian bondage” by 
developing a learning set to infer a prin- 
ciple. However, no study has investigated 
the question of transfer of learning sets 
employing systematic Thorndikian proce- 
dures which would permit the assessment 
of LWA in learning sets. 

The purpose of the experiment to be re- 
ported here is (a) to investigate the effects 
of two procedures for developing a learning 
set to infer a principle in training and the 
subsequent effect of the training on the 
development of a learning set to infer a 
reversal principle during the transfer 
period; (b) to compare the LWA process 
and final criterion performance of the two 
training conditions; (c) to determine 
whether or not the two experimental groups 
show LWA than the control 
groups during the transfer period; and (d) 
to indicate whether or not LWA can be 
attributed solely to S’s partially correct 
verbalizations for his performance. 


less two 
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Task. Each S was presented with a series 
of individual stimulus words from 2 to 10 
letters in length, and was required to re- 
spond to each word with a number from 1 
through 9. The FE administered continuous 
reinforcement. For the two experimental 
groups in the training period, S’s response 
was called Right if the number was equal 
to the number of letters in the stimulus 
word minus one, e.g., well = 3, general = 
6, ete. Any other response was called 
Wrong. Thus the principle upon which 
reinforcement was administered, and which 
the Ss could learn, was based upon a direct 
relationship between the length of the 
stimulus word and t'. correct response, 
i.e., as words become ionger the numbers 
become larger. 

In the transfer situation an entirely dif- 
ferent series of stimulus words was used. 
However, the stimulus remained 
words from 2 to 10 letters in length and 
the response class numbers from 1 through 
9. An S’s response was now called Right 
if the number was equal to 11 minus the 
number of letters in the stimulus word, e.g., 
well = 7, general = 4, etc. Any other re- 
sponse was called Wrong. Thus the principle 
upon which reinforcement was now admin- 
istered, and which S could learn, was the 
“reverse” of the training principle, i.e., as 
words become longer the number becomes 
smaller. This will be called the reversal 
principle. 

Materials and procedure. The stimulus 
material consisted of 360 common English 
words, each typed in capital letters on a 4 
by 6 white index card. Stimulus cards were 
presented at intervals of approximately 5 
sec. After every block of nine trials, Ss in 
the two training groups and the four trans- 
fer groups were asked, “Upon what basis 


class 


were you responding?” Each S’s response 
to this question was recorded by E along 
with S’s number response. It was intended 
that this question serve two purposes: (a) 
to elicit a different class of responses (from 
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the number responses) to aid in an inter- 
pretation of the learning process and (6) 
to direct attention to a basis or principle 
for responding. When Ss in the training 
and transfer periods reached a criterion 
where they could correctly verbalize the 
principle followed by errorless performance 
for one block of trials, the training and 
transfer periods were terminated. 

In the training period, for the two experi- 
mental groups, there was a maximum of 
162 stimulus words arranged in 18 blocks 
of 9 words each. One of the experimental 
groups was presented with a homogeneous 
sequence in which each block of trials con- 
tained different words with the same num- 
ber of letters (hereafter, Exp. Hom.). For 
this group, the order in which blocks of 
words of different lengths appeared was 
determined randomly. A second experimen- 
tal group was presented with a heteroge- 
neous sequence in which block of 
trials contained different words with one 
word of each length randomized within each 
block (hereafter, Exp. Het.). 

The two control groups had only a 
“warm-up” during the “training” period, 
which consisted of guessing numbers in 
response to each of 45 stimulus words, with 
no acknowledgement of Right and Wrong 
by £. This procedure was not only em- 
ployed as a control for warm-up effects 
but, also, to establish an empirical chance 
level of performance. One group received 
a homogeneous sequence (hereafter, Cont. 
Hom.) and the other a heterogeneous se- 
quence (hereafter, Cont. Het.) of stimulus 
words. During neither of the two warm-up 
procedures were Ss asked, “Upon what 
basis were you responding?” 

The transfer period immediately followed 
the training and warm-up periods. There 
was a maximum of 198 stimulus words 
arranged in 22 blocks of nine words each. 
All four groups (two experimental, two 
control) during the transfer period were 
presented with a heterogeneous sequence. 

Instructions. Before beginning the train- 
ing procedure, Ss in the two experimental 
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and two control groups were read the fol- 
lowing instructions: 


I have prepared a list of words and have 
paired a number with each word. I have 
used numbers from 1 through 9. I have the 
numbers written on a paper on which I 
shall record your responses. It is a long list 
and I shall go through it only once. Try to 
give your responses as quickly as you can. 
Are there any questions. 


In addition, the two experimental groups 
were told: 

I am interested in the number-response 
which you learn to give to each word, as a 
function of my saying Right or Wrong after 
each response. 


It was intended that this latter instruc- 
tion convey to S an “intentional set” to 
learn. Previous studies using similar proce- 
dures have employed an “incidental set” 
to learn (Hirsch, 1957; Philbrick & Post- 
man, 1955). Before the training began, any 
questions pertaining to instructional proce- 


dures were answered. 

The following instructions were read to 
each S before beginning the transfer proce- 
dure: 

Now, I'll show you a different set of cards 
or a different set of words. Again, however, 
I shall present you with a series of cards on 
which you will see one word. Respond to 
each word with a number from 1 through 9. 
I am interested in the number-response 
which you learn to give to each word, as a 
function of my saying Right and Wrong 
after each response. 


Design. There were four groups of 20 Ss 
each: two experimental and two control 
groups. The transfer situation consisted of 
a 2 by 2 factorial design in which the fac- 
tors are the presentation treatments (ho- 
mogeneous or heterogeneous) and the 
training treatments (experimental or con- 
trol) which the four groups received during 
the training and warm-up periods. 

Subjects. The Ss were 80 students from 
undergraduate classes in education at the 
University of California, Berkeley, who 
were assigned by means of a table of ran- 
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dom numbers to one of the four groups, 
with a final correction for equal Ns. Each 
S volunteered to participate in the experi- 
ment: by such a selection, the sample may 
have been restricted with regard to moti- 
vation and ultimate performance. 

RESULTS 

The data for the training and transfer 
periods have been analyzed with reference 
to: (a) the over-all performance in terms 
of the number of blocks of trials to criterion 
and (6) an analysis of the learning process 
as reflected in learning curves. 

Trials to criterion. All of the Ss in the 
two experimental groups during the train- 
ing period were able to correctly verbalize 
the principle. However, the rapidity with 
which Ss correctly verbalized and reached 
the learning criterion differed for the two 
groups. The mean number of blocks of 
trials to criterion in training were 6.20 and 
9.10 for the Exp. Hom. and Exp. Het. 
groups, respectively. The mean differences 
were significant beyond the .02 level (¢ = 
2.62, df = 38). Thus, as was expected, the 
group presented with a series of different 
words of the same length in each block 
(Exp. Hom.) learned how to learn the 
principle more rapidly than the group pre- 
sented with a series of different words of 
varying length in each block (Exp. Het.). 

As was found during the training period, 
all Ss in the four groups during the transfer 
period were able to correctly verbalize the 
principle. The mean number of blocks of 
trials to criterion in the transfer period 
were 10.95, 10.90, 17.00, and 17.20 for the 
Exp. Hom., Exp. Het., Cont. Hom., and 
Cont. Het. groups, respectively. An analy- 
sis of variance was conducted to test the 
statistical significance among the four 
groups. The main effect of the training 
(versus control) treatments during training 
on the transfer situation was highly signifi- 
cant (F = 54.37, df = 1 and 76, p < .01). 
Therefore, learning to learn a principle in 
training facilitated learning to learn the 
reversal principle during the transfer pe- 
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riod. Thus Harlow’s (1949) findings with 
nonverbal primates has been somewhat 
verified with highly verbal primates, 
namely, university students. The presen- 
tation treatments in training, however, did 
not contribute a source of variance on the 
number of blocks of trials to criterion in 
transfer (F = .00). Furthermore, the inter- 
action (training xX presentation) can be 
accounted for in terms of random sam- 
pling (F = .02). 

Analysis of learning curves. With nine 
stimulus words per block and nine possible 
numbers to choose from, the number of 
correct responses expected by chance is 1.0. 
As Philbrick and Postman (1955) have re- 
ported, the probability is less than .01 that 
S could obtain four correct responses per 
block of nine trials by chance alone. In an 
effort to establish an empirical chance level, 
Ss in each of the two control groups were 
given five blocks of nonreinforced trials. 
The mean performance curves for the two 
control groups fell at the chance level of 
expectati: 

Figure . presents the mean number of 
correct responses prior to correct verbaliza- 
tion of the principle for the two experimen- 


70 


o---0 EXP. HOM 
EXP HET 


MEAN NUMBER CORRECT RESPONSES 


JULIUS M. SASSENRATH 


tal groups during the training period. Since 
both groups had errorless performance fol- 
lowing correct verbalization of the prin- 
ciple, the two blocks of trials on which Ss 
correctly verbalized the principle and then 
reached criterion are omitted in Figure 1. 
Inasmuch as Ss verbalized the principle 
after different numbers of blocks of trials, 
the data were made comparable by the use 
of Vincent curves (Hilgard, 1938). A log 
transformation of the data was performed 
in order to reduce heterogeneity of vari- 
ance. 

Grant’s (1956) procedure for the statis- 
tical analysis and comparison of curves was 
applied to the scores comprising the curves 
seen in Fig. 1. This statistical analysis 
indicates the over-all trend or slope of the 
curves is significantly different from zero 
(F = 23.36, df = 3 and 114, p < .001). 
Thus, there is substantial learning prior to 
correct verbalization of the principle. How- 
ever, when analyzing for the type of slope 
of the over-all trend, only its linear com- 
ponent was significant (F = 36.41, df = 1 
and 38, p < .001). Therefore, the rate of 
learning to learn over successively differ- 
ent S-R associations appears to show a 
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constant increase prior to correct verbaliza- 
tion of the principle. The between-group 
means is not significant (F = .05), and 
indicates that the average number of cor- 
rect responses over the Vincent quarters is 
very similar for the two groups. However, 
since the statistical analysis indicates that 
the between-group trends is significant 
(F = 4.02, df = 3 and 114, p < .01), it is 
apparent that the slopes for the two learn- 
ing curves are different with group Exp. 
Hom. having the steeper slope. When ana- 
lyzing for the type of slope of the between- 
group trends, only its linear component is 
. significant (F = 5.37, df = 1 and 38, p < 
.05). Thus, although the two groups learned 
at different rates, the rate of improvement 
was constant within each group. Finally, a 
highly significant difference in the be- 
tween-individual means (F. = 9.76, df = 
38 and 114, p < .001) indicates that reliable 
measures of individual differences were ob- 
tained. 

A more critical analysis of the perform- 
ance of group Exp. Hom. appears war- 
ranted since neither Philbrick and Postman 
(1955) nor Hirsch (1957) found such high 
levels of performance prior to correct ver- 
balization. It will now be shown that the 
level of performance prior to correct state- 
ment of the principle is not based solely 


upon partially correct verbalizations of 
the principle. In order to facilitate this 
analysis, Hirsch’s (1957) procedure was 


employed in which a partially correct ver- 
balization was defined as a statement by 
S indicating that the magnitude of his 
number responses were related to the num- 
ber of letters in the stimulus words. Any 
other basis for responding was judged to 
be an incorrect statement of the principle. 
If improvement prior to correct verbaliza- 
tion can be attributed to partially correct 
verbalizations, then the largest number of 
such verbalizations should be offered on 
that block immediately prior to correct 
verbalization. Of the 20 Ss in group Exp. 
Hom. in training, 15 offered incorrect ver- 
balizations and 5 offered partially correct 
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verbalizations on the block of trials imme- 
diately prior to correct verbalization of 
the principle. Yet, the mean difference be- 
tween the two subgroups in the number of 
correct responses on the block of trials 
immediately prior to correct verbalization 
is not significant (¢ = .16). Thus, the high 
level of performance by Ss offering incor- 
rect verbalizations of the principle in 
Group Exp. Hom. tends to invalidate the 
argument that partially correct verbaliza- 
tions can be invoked a priori to define away 
the phenomenon of LWA. 

Figure 2 presents the mean number of 
correct responses prior to correct verbaliza- 
tion of the reversal principle for the two 
experimental and two control groups during 
the transfer period. Again, since all four 
groups had errorless performance following 
correct verbalization of the reversal prin- 
ciple, the two blocks of trials on which S 
correctly verbalized the principle and then 
reached criterion are omitted in Fig. 2. 
Following the construction of Vincent 
curves (Hilgard, 1938), a log transforma- 
tion of the data was performed in order to 
reduce heterogeneity of variance. 

Grant’s (1956) procedure for the statis- 
tical analysis and comparison of curves was 
applied to the scores comprising the curves 
seen in Fig. 2. The over-all trend or slope 
of the curves is significantly different from 
zero (F = 114.16, df = 3 and 228, p < 
.001). Thus, there is substantial learning 
prior to correct verbalization of the reversal 
principle. Furthermore, when analyzing 
the over-all trend for the type of slopes 
represented in these curves, both the linear 
component (fF = 371.69, df = 1 and 76, 
p < .001) and the quadratic component 
(F = 4.85, df = 1 and 76, p < .05) are 
significant. These latter results indicate 
that the rate of learning is largely constant 
with a small positive acceleration which is 
probably not due to chance alone. Thus, 
the process of learning the reversal prin- 
ciple under the conditions of this study is 
only slightly curvilinear. The between- 
group means appears to be highly signifi- 
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Fig. 2. PeErrFoRMANCE Prior TO CorrecT VERBALIZATION OF THE PRINCIPLE DuRING 


THE TRANSFER PERIOD. 


cant (F = 12.19, df = 3 and 76, p < .001), 
indicating that the average number of 
correct responses over the Vincent quarters 
are different among the four groups. When 
the source of variation of the between- 
group means is partitioned into the two 
main and interaction effects, only the main 
effect of training (versus control) is signifi- 
cant (F = 34.73, df = 1 and 76, p < .001). 
This finding indicates, as may be seen in 
Fig. 2, that the control groups gave a larger 
number of correct responses than the ex- 
perimental groups. However, as pointed 
out earlier, the two control groups required 
significantly more blocks of trials to crite- 
rion than did the experimental groups. 
Hence the training treatments given the 
experimental groups, as compared with the 
control groups, resulted in dual effects 
during transfer: (a) fewer correct responses 
prior to awareness, and, at the same time, 
(b) fewer trials to reach the awareness 
criterion. The between-group trends is not 
significant (F = .46), indicating that the 
rates of learning among the groups are very 
similar. Partitioning of the source of varia- 
tion of the between-group trends into 


linear and quadratic components and their 
respective main and interaction effects 
showed no significant F values. However, 
a highly significant difference between in- 
dividual means (F = 4.01, df = 76 and 
228, p < .001) indicates that reliable meas- 
ures of individual differences were obtained. 

Can the large increase in performance 
during the transfer period for the two ex- 
perimental and two control groups prior to 
verbalized awareness be accounted for by 
partially correct verbalizations of the prin- 
ciple? Of the 80 Ss, 69 gave partially correct 
and 11 gave incorrect verbalizations on 
that block of nine trials when they first 
reached four correct responses. The proba- 
bility of obtaining 4g correct by chance 
alone is .01 (Philbrick & Postman, 1955). 
Furthermore, on that block of trials imme- 
diately preceding correct verbalization of 
the principle, 73 Ss gave partially correct 
and seven gave incorrect verbalizations of 
the principle. Therefore, it appears that 
the high level of performance under the 
conditions of the transfer period can largely 
be attributed to partially correct verbaliza- 
tions of the reversal principle. 
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Discussion 


It is, of course, apparent that an inter- 
pretation of the results is dependent upon 
one’s definition of awareness. A customary 
definition has been a correct verbalization 
by S about the reasons governing his re- 
sponses (Adams, 1957; Postman, 1955). 
Such a definition does not preclude im- 
provement resulting from partially correct 
verbalizations. If one is still content to 
accept the customary definition, then group 
Exp. Hom. during training showed some— 
but in absolute amounts very littlke—LWA, 
while during the transfer period the two 
experimental groups—and particularly the 
two control groups—showed strong evi- 
dence for LWA. 

If one does not agree with the usual defi- 
nition of awareness, but more stringently 
defines awareness as partially correct ver- 
balizations by S for his behavior, then 
group Exp. Het. during training and the 
four groups during transfer showed very 
little, if any, LWA. However, if awareness, 
in this study, is redefined as partially cor- 
rect verbalizations of the principle and 
“partially correct”’ is defined as number 
responses by S which are related to the 
number of letters in stimulus words, this 
experiment still provides substantial evi- 
dence for LWA. Witness the results indi- 
cating that most Ss in group Exp. Hom. 
demonstrated a high level of performance 
and yet offered incorrect verbalization of 
the principle. Hirsch (1957) has also found 
that partially correct verbalizations could 
not sufficiently explain his evidence for 
LWA. If this is the case, then there must 
be some validity in Thorndike’s (1935, p. 
40) hypothesis that the consequences of 
different S-R associations can operate to 
increase performance without S_ under- 
standing what he is learning. Yet, depend- 
ing upon the experimental conditions, it 
appears that learning to learn a principle 
may occur, in part, without awareness, 
followed by partial awareness and then 
verbalized awareness. 
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Finally, for educational theory and prac- 
tice two points could be mentioned. First, 
Estes (1956) has pointed out that research 
on LWA does not neglect the importance 
of those variables which “lead the S to 
verbalize relationships.”’ However, it should 
be emphasized that for education it is par- 
ticularly important to identify those vari- 
ables or conditions which foster LWA. As 
a result, educators could possibly better 
define the conditions which maximize 
awareness (verbalization) or understanding 
of relationships. The question then arises 
whether or not the problem of LWA versus 
learning with awareness in psychology is 
similar to the old problem of mechanical 
versus meaningful learning in education. 

Second, research on learning sets or 
learning how to learn, employing preschool 
children, may indicate whether or not one’s 
facility to learn could be greatly increased 
by early and extended learning experiences. 
If confirmed, then such learning experience 
would be contrary to the theory of “wait- 
ing for nature to produce maturation before 
introducing topics’ (Cronbach, 1950, p. 
237). Thus educational theory, practice, 
and research might re-emphasize learning 
how to learn rather than just learning 
when to learn. 


SUMMARY 


This investigation is concerned with the 
influence of two procedures for presenting 
stimulus material on learning a principle in 
training and the subsequent effect of this 
training on learning a reversal principle 
during the transfer period. The group in 
training which received a heterogeneous 
stimulus-word presentation showed only 
little evidence for LWA, while the group 
which received a homogeneous stimulus- 
word presentation gave substantial evi- 
dence for LWA. For the latter group, 
partially correct verbalizations for the 
principle could not account for this LWA. 

The group which received a homogeneous 
presentation inferred the training principle 
more readily than the group which received 
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a heterogeneous presentation. Yet, the two 
procedures for presenting the stimulus 
material in training had no differential 
effect on learning the reversal principle 
during the transfer period. However, learn- 
ing to learn the training principle under 
either of the two presentation treatments 
did facilitate learning to learn the reversal 
principle. During the transfer period, the 
two experimental groups—and particularly 
the two control groups—evidenced a higher 
level of performance prior to awareness of 
the reversal principle. This high level of 
performance could be attributed largely to 
partially correct verbalizations of the 
principle. 
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VALUE DIFFERENCES BETWEEN PUBLIC AND PRIVATE SCHOOL 
GRADUATES 


W. CODY WILSON 


Harvard University 


McArthur (1955) raised the question of 
personality differences between upper- and 
middle-class adolescents and presented 
data in support of a hypothesis concerning 
such differences. The data presented were 
differences between public and _ private 
secondary school graduates, who were fresh- 
men at Harvard College, on responses to 
Thematic Apperception Test pictures. 
These differences were predicted on the 
basis of general knowledge of social classes 
and on Kluckhohn’s (1953) ideas of domi- 
nant and variant value orientations. 

MeArthur’s findings stimulate several 
interesting questions. Are these same dif- 
ferences manifested on more direct meas- 
ures of values? Do the two groups of stu- 
dents differ on other dimensions of values? 
Are these differences in values due to the 
different experiences of the two groups in 
their secondary schools, or are they mere 
reflections of differences between the two 
groups in such background variables as 
socioeconomic status and religious orienta- 
tion of their families? Do these differences 
persist through four years’ experience in a 
liberal. arts college, or have they disap- 
peared by the end of the senior year? 

As a part of a larger study, further data 
pertinent to the question of differences 
between public and private school gradu- 
ates were collected. These data do not 
answer all the questions raised by Me- 
Arthur’s findings; they do, however, pro- 
vide tentative answers to some of the 
questions and help to rephrase and sharpen 
some of the others. 

PROCEDURE 

The data reported here are from an ex- 

tensive questionnaire answered by 165 


seniors at Harvard College during the 


middle of the academic year 1956-57. The 
questionnaire included a number of items 
inquiring into antecedent background char- 
acteristics of the students and four sets of 
items reflecting various value dimensions. 
These latter sets of items were: (a) 16 
items on occupational and work values 
which were an adaptation and extension of 
items developed by Centers (1949); (b) 20 
items measuring the four dimensions ex- 
tracted by Bales and Couch (1956) in a 
factor analysis of a large domain of values 
covered by the Value Profile Test; (c) 12 
items reflecting the different modes of 
answering value orientation questions posed 
by Kluckhohn (1953); and (d) an item re- 
flecting academic achievement. 

The sample was composed of 88 public 
school graduates and 77 private school 
graduates who were seniors at Harvard 
College. There were no statistically signifi- 
cant differences between the two groups in 
terms of patterns of geographical origin, 
family religious orientation, college resi- 
dence, extracurricular activities, scholastic 
aptitude, or academic area of concentra- 
tion. The two groups did differ in terms of 
fathers’ occupations and family incomes, 
with the private school group containing 
more students whose fathers were in higher 
status occupations and whose family in- 
come was more than $7000 a year (p less 
than .001 in both cases). 

RESULTS 

The two groups differed, at the .05 level 
of confidence, on 6 of 16 items concerned 
with occupational values (see Table 1). 
More public school graduates valued the 
opportunities to exercise a particular com- 
petence, to make a contribution to society 
at large, and to add to the accumulating 
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TABLE 1 


PROPORTION OF PUBLIC AND Private Scnoot Grapvates Rerortinc CERTAIN 


OccUPATIONAL VALUES TO BE IMPORTANT IN THEIR CHOICE oF A CAREER 


Value 


Proportion Reporting 


Value to be Important 
Probability 


of Difference 


(N = 88) (N = 77) 


Public Private | 


To make Parr and give orders 23 37 .05 
To assume responsibility 55 70 .05 
To be in close contact with people 50 66 05 
To exercise a particular competence 50 29 01 
To make a contribution to the society at large 75 | 59 05 
To add to the accumulating body of knowledge | 42 05 
and culture 
To exercise intelligence in the solving of prob- 72 60 .10 
lems 
To express your own personality 60 48 j 12 
To work as an individual 10 19 13 
To be a leader 40 48 .25 
To have varied and interesting experiences 7 78 | .25 
To be looked up to by others 30 37 .25 
To be assured of a steady income and perma- | 7 29 .25 
nent position 
To make a good deal of money 20 25 25 
To be of personal service to others 54 60 25 
36 30 25 


To be dusennmmentdh and your own boss 


body of knowledge and culture; and more 
private school graduates valued the oppor- 
tunity to make decisions and give orders, 
to assume responsibility, and to be in close 
contact with others. 

In terms of Kluckhohn’s (1953) value 
orientations, the two groups differed, at 
the .05 level of confidence, in three of the 
four areas (see Table 2). The public school 
graduates were more Doing, Individualis- 
tically, and Man-over-nature oriented; and 
there were no statistically significant dif- 
ferences in Time orientation. 

Public and private school boys differed, 
at the .05 level of confidence, on two of 
Bales and Couch’s (1956) four factors (see 
Table 3). The public school graduates were 
higher on Equalitarian Ideology and In- 
dividual Orientation; there were no differ- 
ences between the two groups in terms of 
Acceptance of Authority and Need-deter- 
mined Assertiveness. 

Public school graduates had higher aver- 


age grades in college (see Table 4). The 
public school graduates were heavily over 
represented in the high honors (A) cate- 
gory, and the private school graduates were 
heavily over represented in the satisfac- 
tory (C) category. 


DISCUSSION 


It is now possible to answer some of the 
questions raised in the earlier discussion of 
McArthur’s (1955) findings. First, are dif- 
ferences between public and private school 
boys in terms of Kluckhohn’s (1953) Value 
Orientations (which were inferred by Mc- 
Arthur from responses to a projective test) 
manifested in a more direct measurement 
situation? The answer is a qualified yes. It 
is not possible, because of the nature of the 
two sets of data, to make a point by point 
comparison between the findings of Me- 
Arthur and those reported here. In general, 
however, the two studies show the same 
pattern of differences; the public school 
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TABLE 2 


RESPONSES OF PRIVATE AND Pusiic ScHoot GrapvuaTtes To Irems REFLECTING 
KLucKHOEN’s (1953) VALUE ORIENTATIONS 


Number Agreeing 
With Statement 
Value Orientation — — 
Private | 
(N = 77) | 


Man-Nature Relationships 
Man-over-nature: The forces of nature must be 
overcome and put to the use of human beings; 
it is a part of man’s duty to overcome nature’s 
obstacles 
Man-in-harmony-with-nature: There is no real 
separation between man and nature; the two 
are essentially in harmony 
Man-subject-to-nature: Man’s destiny is subject 
to the whim of nature 
Relational 
Individual: Individual goals should always take 
precedence over the goals of the family and 
other groups 
Colateral: A man should belong to some group 
whose roles and goals are more important than | 
all other goals for him 
Lineal: The goals of one’s family—and especially 
its continuance through generations—are the 
most important goals a person can have 
Activity 
Doing: A man’s life becomes meaningful in terms 
of his accomplishments; that is, he should be 
judged by what he does 4.2 
Being-in-becoming: The goal of life is the de- 
velopment of all aspects of the self as an inte- | 
grated whole 2.4 
Being: For a mature personality the preferred | 
mode of behavior is the spontaneous expression 
of one’s innate self 0.1 
Time 
Future: A person should face toward the future— 
for there lies the fulfillment of life 1.4 
Past: The primary emphasis of life should be 
upon the restoration and maintenance of the | 
traditions of the past 2.0 
Present: One should live primarily in the present | 
0.0 


graduates exemplifying traditional Ameri- values? Again the answer is a qualified yes. 
can value orientations and the private The data reported here covered three addi- 
school graduates emphasizing these values _ tional value areas, and the public and pri- 
to a lesser extent and consequently dis- vate school graduates differed in each of 
playing a somewhat variant orientation. these areas. The qualification is concerned 

A second question was, do the two groups with the question of independence of the 
of students differ on other dimensions of four value areas. It is quite obvious, for 
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TABLE 3 
MEDIAN Scores or Pusuic AND Private ScHOOL 
GRapuaTes ON Four VALUE SCALES 


Acceptance of Authority 

Need-determined Assertiveness (Value-de- 
termined restraint) 

Equalitarian Ideology 

Individual Orientation 


TABLE 4 
Grape AVERAGES OF PUBLIC AND 
PRIVATE ScHOOL GRADUATES 


Number in Each 
Grade Category 
Group 


Public (NV 8S) 
Private (N = 77) 


x? = 249; p < O01. 


example, that the value domains of Kluck- 
hohn and of Bales and Couch overlap; both 
formulations contain a dimension of “In- 
dividualism.” The extent of overlap among 
the four value areas has not been deter- 
mined, but it seems reasonable, from the 
different formulations of the items and the 
different content in them, to assume that in 
general they do cover different aspects of 
the total possible value domain. The data 
presented in this study do not completely 
answer the question of the extent of value 
differences between public and _ private 
school graduates but they do indicate that 
the differences extend beyond the areas 
covered by Kluckhohn’s formulations and 
MeArthur’s data. 

A third question raised by MeArthur’s 
findings was, are the differences in values 
between public and private school gradu- 
ates a simple reflection of differences be- 
tween the two groups in background vari- 
ables such as socioeconomic status and re- 


Median Scores 


Chi Square 


Public Private Median Test 


|(N = 88)|(N = 77) 


11.0 


10.9 
15.6 
15.9 


ligious orientation of their families? It was 
reported earlier that there were no statis- 
tically significant differences between the 
two groups in terms of religious orientation 
of family and geographical origin; therefore, 
the differences cannot be attributed to these 
antecedents. Public and’ private school 
boys did differ, on the other hand, in terms 
of fathers’ occupation and family income. 
The difference in value between these two 
groups is not, however, a simple reflection 
of these differences in fathers’ occupation 
and family income. When the total sample 
was divided into new subsamples on the 
basis of, first, father’s occupation and, sec- 
ond, family income, the value differences 
described previously did not occur between 
the new subsamples, bus new value differ- 
ences arose. For example, there were no 
differences between students whose families 
have an income of above $7000 a year and 
those whose families have an income less 
than $7000 in terms of the Bales and Couch 
factors Equalitarian Ideology and In- 
dividual Orientation as there had been be- 
tween public and private school graduates, 
but there was a difference between the two 
groups on the dimension of Need-deter- 
mined Assertiveness. Similar results were 
found on the other value measures used. It 
must be concluded, then, that the differ- 
ences between public and private school 
graduates is not to be explained as a simple 
reflection of differences between these two 
groups in terms of the background varia- 
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bles Religion, Geographical Origin, Fathers’ 
Occupation, and Family Income. It is not 
possible, of course, with the data available, 
to answer the further question of whether 
the differential values are developed in the 
schools or as a result of a more subtle se- 
lective process operating in the family’s 
choice of a public or private school for the 
son. 

A fourth question raised earlier was, do 
differences in values between public and 
private school graduates persist through 
four years’ experience in a liberal arts col- 
lege? The answer is yes. The Ss of this 
study were in the middle of their senior 
year in college, yet differences in values be- 
tween the two groups were found which 
were similar to the differences found be- 
tween two such groups during a freshman 
year in college. It was reported earlier that 
thefe were no significant differences be- 
tween the two groups in terms of college 
residence, extracurricular activities, and 
academic areas of concentration; the boys 
participated in the same general social and 
intellectual environment in college. An ex- 


planation of this persistence must be sought 


in some more subtle influence. 
CONCLUSION 

At least four alternative explanations of 
this phenomenon may be suggested: (a) 
the college experience does not have much 
effect upon student values (ef. Jacobs 1927) ; 
(b) differences at the end of the secondary 
school experience are so large that vestiges 
of them remain after three and one-half 
years of common college experiences; (c) 
there are informal selective social systems 
within a college environment which tend 
to maintain and enhance cultural differ- 
ences; or (d) a combination of the previous 
three influences may be operating. Infor- 
mation which would enable one to choose 
among these alternatives is not now avail- 
able. 

An interpretation of these differences on 
the manifest level is not difficult. The public 
school boys are characterized by stronger 
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orientation in the directions of Doing, Man- 
over-nature, Individualism, Equalitarian- 
ism, and Achievement—almost a listing of 
typical or dominant American values. The 
private school boys, on the other hand, re- 
flect weaker orientations in these directions 
but reflect no clear-cut alternative values. 
But, are these existing differences the sur- 
viving vestiges of a disappearing class 
differentiation, or do they hail the subse- 
quent decline of the traditional American 
value system and the emergence of a new 
ideal with the Eastern preparatory school 
graduates as an avant garde (cf. Riesman, 
1950)? For example, the occupational 
values of the private school boys would 
seem to be particularly suited to the or- 
ganizational executive role, which is a 
relatively new development on the occupa- 
tional scene, while the values of the public 
school boys seem more consonant with the 
traditional professional roles. And, indeed, 
the occupational choices of 930 Harvard 
seniors support this hypothesis (data col- 
lected in the larger study): the private 
school graduates tend to choose bureau- 
cratic business, and the public school boys 
tend to choose science, medicine, and col- 
lege teaching; the quantitative data on law 
is ambiguous, but a closer inspection of the 
data shows that the private school boys 
tend to think of law as a preparation for an 
executive role in business or other organiza- 
tions, and the public school boys are in- 
terested in the practice of law itself (chi 
square with two degrees of freedom equals 
24.9, p less than .001). 

A question naturally arises concerning 
the generality of these findings beyond the 
population from which the sample was 
drawn. Any comments, until empirical 
data are available, must be purely specu- 
lative. On the basis of the fact that the 
value differences were not a simple reflec- 
tion of differences in certain background 
characteristics, other than type of second- 
ary school attended, the working hypothesis 
must be that these differences between 
public and private secondary school gradu- 
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ates will also be found in other populations. 
On the other hand, the private school is not 
as widespread a phenomenon in the South, 
Midwest, and West as it is in the North- 
east, and it may well be the case that the 
private schools in these other parts of the 
country serve different motives and func- 
tions. In such a case the working hypothesis 
would have to be revised. 


SUMMARY 


Responses to a set of value items by 88 
public school graduates and 77 private 
school graduates who were seniors in a 
large liberal arts college were compared. 
The public school boys were found to be 
more Doing, Man-over-nature, Individu- 


alistic, Equalitarian, and Achievement 
oriented; more public school boys valued 
the opportunity to exercise a particular 
competence, to make a contribution to 
society, and to add to the accumulating 
body of knowledge and culture; and they 
valued less the opportunity to make deci- 
sions and give orders, to assume responsi- 
bility, and to be in close contact with 
others. These differences were not a simple 


WILSON 


reflection of antecedent background char- 
acteristics, such as religious orientation or 
socioeconomic status, but they are similar 
to differences found between other groups 
of public and private school graduates as 
college freshmen. The implication of these 
differences for a changing American char- 
acter are discussed. 
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SOME LIMITATIONS OF TEACHER RATINGS AS 
PREDICTORS OF CREATIVITY! 


JOHN L. HOLLAND 
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The assessment of students by means of 
teacher ratings is an extensive practice 
which has 4 number of influences: These 
evaluations may affect the student’s going 
to college, obtaining a scholarship, having 
a positive self-concept, or feeling that he is 
capable of high achievement and creativity. 
Thus it is important that we acquire a 
more adequate knowledge of these ratings 
so that they can be made more valuable 
for both the student and the teacher who 
spends valuable time making these evalua- 
tions. 

Like most ratings, teacher ratings ex- 
hibit “halo” effect and have ambiguous 
validities. Since technical efforts to develop 
independent scales and to train raters have 
not proved generally successful, another 
approach seems in order. Ryan (1958) has 
suggested that it may be wiser to explore 
the meaning of ratings through their em- 
pirical correlates, using a wide range of 
information, than to hope “for literal 
measures of traits.” In other words, the 
objective is not to obtain ratings of well- 
defined student behaviors, but to explore 
the factors which influence the teacher 
to rate students favorably or unfavorably. 

This study attempts such exploration. 
Ratings of high school seniors made by their 
teachers and principals have been corre- 
lated with a variety of personality and 
achievement inventories, scholastic apti- 
tudes, student reports of their activities 
and avocational interests, and demographic 
information. 

1 This study was partially supported by 
the National Science Foundation and the 
Old Dominion Foundation. The author is 
indebted to Donald L. Thistlethwaite and 
Laura Kent for their constructive reviews 
of this paper. 


PROCEDURE 


The students rated in this study (783 
boys and 394 girls) are 83% of a one-sixth 
random sample drawn from 7500 students 
who were the survivors (finalists) in the 
1958 National Merit Scholarship program. 
The Sixteen Personality Factor Question- 
naire (16 P.F.), the National Merit Stu- 
dent Survey (NMSS), and the Vocational 
Preference Inventory (VPI) were adminis- 
tered by mail to this sample. 

Form A of Cattell’s 16 P.F. test. was 
used. This inventory is well known and has 
been described in a number of publications 
(Cattell, 1957; Cattell, Saunders, & Stice, 
1957). The NMSS is an experimental 
achievement inventory devised by the Na- 
tional Merit staff from a review of the liter- 
ature. It consists of 10 internally consistent 
scales assumed to measure some of the more 
important personality and attitudinal 
variables which are related to academic 
achievement. These include: Dedication to 
Scholarship, Dependency, 
Play, Intellectualism, Introversion, Par- 
ental Press, Persistence, Super Ego, and 
Tolerance for Ambiguity. The VPI, an 
experimental personality inventory com- 
posed of occupational titles, is a revision of 
the Holland Vocational Preference Inven- 
tory which has been described elsewhere 
(Holland, 1958). The VPI consists of the 
following scales: Acquiescence,  Infre- 
quency, Physical Activity, Intellectuality, 
Responsibility, Conformity, Verbal Ac- 
tivity, Emotionality, Aggressiveness, Con- 
trol, Masculinity, and Status. In addition 
to this test data, the students listed their 
various back- 


Dominance, 


activities, interests, and 
ground data. 


Teachers and principals filled out part 
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of an extensive information blank including 
12 graphic rating scales. The rating scales, 
which covered emotional, intellectual, and 
physical traits and aptitudes, were divided 
into six intervals: Among best .01, .05, .10, 
.25, Average, Below Average, or No Ob- 
servation. School personnel were instructed 
to “rate this student relative to the high 
school seniors you have known during the 
past five years.” In general, the ratings were 
made by teachers, principals, or guidance 
workers, but there is no evidence to indi- 
cate which of these groups did most of the 
ratings or how much they collaborated in 
making evaluations. 
RESULTS 

The rating scales were intercorrelated to 
test their independence. The average inter- 
correlation among the 12 scales is .64 and 
.59 for the male and female samples re- 
spectively. The average intercorrelation 
for each of the 12 scales against the 11 re- 
maining scales ranges from .60 to .70 for 
males, and from .51 to .65 for females. It is 
clear that the ratings are closely related 
despite the diversity of the personal quali- 
ties rated. Table 1 shows the intercorrela- 
tions among the rating scales. 

In view of the high intercorrelations 
(which suggest that there is a strong halo 
effect), the variable with the highest aver- 
age intercorrelation for both boys and girls 
was selected as the rating most representa- 
tive of the 12 ratings. This rating, Maturity, 
can be regarded as a measure of the degree 
to which students are rated “high” or 
“low,” and perhaps held in “high” or 
esteem by school personnel. 

High versus low ratings of Maturity 
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were then correlated with student scores 
for each of the following inventories, 
scholastic aptitudes, and background vari- 
ables: Sixteen Personality Factor Question- 
naire, Vocational Preference Inventory, 
National Merit Student Survey, high school 
ank, father’s and mother’s educational 
level, number of student offices (elected), 
and Scholastic Aptitude Test. The corre- 
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lations, which were computed using the 
Davidoff and Goheen method for estimat- 
ing tetrachoric correlations, are shown in 
Table 2. Twenty-four per cent of the cor- 
relations in Table 2 are significant beyond 
the .05 level so that the results probably 
cannot be attributed to chance. 

To facilitate comprehension of Table 2, 
the student characteristics are ordered be- 
low according to their absolute correla- 
tions with Maturity. All student character- 
istics which have at least one significant 
correlation with the criterion are included. 
The first correlation in parentheses refers 
to the male sample; the second refers to the 
female sample. These correlations are 
probably attenuated due to restriction in 
range: the average aptitude level and high 
school rank for these samples are about two 
standard deviations above the national 
norms; the samples are also restricted in a 
variety of background variables, including 
parents’ education, family income, ete. 


No. Variable No. Variable 
39 HighSchool Rank 7 Shy (—.14, .14) 
(.35, .35) 43 SAT-Verbal (.11, 
24 Persistence (.19, 16) 
4) 16 Tense (—.12, 
42 Elective offices —.14) 
(.19, .23) 6 Persistent (.06, 
15 Control (.10, .26) 18) 
20 Play (—.15,—.20) 19 Dominance (.00, 
21) 
25 Super-ego (.18, 28 Infrequency 
13) —.01, .17) 
12 Insecure (—.17, 14 Self-sufficient 
—.12) (.12, —.05) 
21 Intellectualism 11 Sophisticated 
(.14, .15) (—.12, .04) 


This summary suggests that the boy 
receiving high ratings is characterized by 
his high grades, persistence, frequent elec- 
tion to student offices, seriousness, re- 
sponsibility, feeling of security, intellec- 
tuality, sociability, high verbal aptitude, 
freedom from tension, self-sufficiency, and 
lack of sophistication. In contrast, the 
boy rated low is characterized by his poor 
grades, erratic effort, less frequent election 
to student offices, playfulness, undependa- 
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TABLE 1 
THe CorRELATIONAL Matrices ror TWELVE TEACHER RaTINGs 
roR 783 Boys anp 394 GIRLs 
Trait 
. Stability 
. Maturity 
. Originality 
. Drive to achieve 
. Dependability 
. Speaking skills 
. Writing skills 
. Citizenship 72 
. Popularity 72 
. Social leadership 82 8 5 6 7 59 
. Intellectual leader 64 6 > 6 8 67 
. Physical vigor 61 54 66 61 51 
70 69 62 8 61 65 62 62 


| 


to 


Note.—Correlations for girls are above the diagonal; correlations for boys are below the diagonal. r is the average 
intercorrelation for each variable against the other 11 variables. 


TABLE 2 


Tue Revation or TeacHerR RatinGs (Maturity) To StupeENT PERSONALITY, ACHIEVE- 
MENT, AND BACKGROUND VARIABLES 


Variable Boys Girls | Variable Boys | Girls 


P.F. 
A Sociable 24. Persistence ; 19°* 34°° 
. B Intelligent 3 | 25. Super Ego 1s**| 13 
C Mature 26. Tolerance for Ambiguity 09 03 
. E Dominant 
F Cheerful f 5 | 27. Acquiescence 00 
. G Persistent j f 28. Infrequency 
. H Adventurous : 20. Physical Activity 
. I Effeminate 9 | Kk 30. Intellectuality 
. L Paranoid 31. Responsibility 
. M Introverted : 5 | 32. Conformity 
. N Shrewd |—12* | 33. Verbal Activity 
2. O Insecure |—17 2 | 34. Emotionality 
3. Qi Radical : 35. Aggressiveness 
. Self-Sufficient : 36. Control 
5. Qs Controlled 26 37. Masculinity 
3. Q, Tense ‘ 38. Status 
NMSS Wisc. Variables 
7. Scholarship |—O8 39. High School Rank (Grades) 
8. Dependency ‘ 40. Fathers Education 
. Dominance 2 41. Mothers Education 
. Play ( 2 42. Elective Offices 
. Intellectualism 43. SAT-Verbal 
. Introversion | 08 | O 44. SAT-Math 
3. Parental Press 


Note.—For boys, r.o = .11, = .15. For girls, r.0 = .16, 7.0 = .21. 
* = 05 level of confidence 
** = Ol level of confidence. 


221 

54 62 
65 65 
62 61 
3 67 «62 
57 62 
7 5A 

i2 47 51 

4 61 64 
‘ales 9 53 58 
8 45 56 
- 70 58 
6 57 
60 

5 

~~ 


222 JOHN L. HOLLAND 


bility, insecurity, nonintellectual interests, 
unsociability, low verbal aptitude, tense- 
ness, dependence, and sophistication. The 
characteristics of girls rated high are much 
like those found for boys; 12 of the 16 
correlates have similar relationships with 
ratings for both sexes. 

The school and community activities of 
the student samples were also related to 
teacher ratings by categorizing student ac- 
tivities and teacher ratings in 2 x k tables 
and testing their significance. For boys, a 
number of school and community activities 
are significantly associated with ratings. 
Boys with high ratings say that their most 
important extracurricular interests are in 
student government and community ser- 
vice. Boys receiving low ratings are less 
active in these areas and more interested 
in athletics. The results for girls are not 
statistically significant. 

In short, the student rated high by his 
teachers appears to be a bright, persistent, 
conscientious academic achiever and stu- 
dent leader. His personal adjustment is 
characterized by self-control, sense of se- 
curity, and freedom from anxiety. 

DISCUSSION 

In a similar study, Tallent (1956) finds 
that high teacher ratings of self-control for 
secondary school boys are positively cor- 
related with intelligence test scores. Tallent 
also suggests that a rating bias may favor 
students who, according to the subscales of 
the control rating schedule, are distin- 
guished by “ability to persevere at a task, 
carefulness and accuracy of work, tendency 
to think before acting, and . . . preference 
for serious conversation or study to sports 
or active games.’’ This interpretation ap- 
pears congruent with the present results 
which show that high ratings are associated 
with persistence, self-control, and academic 
interests. 

The empirical correlates of these teacher 
ratings are particularly interesting when 
they are examined in relation to our present 


knowledge of achievement and creativity. 
In a series of researches based on Cattell’s 
16 P.F., Cattell (1955), Drevdahl (1956), 
and Drevdahl & Cattell (1958) compared 
creative and noncreative people in samples 
of college students, teachers, adminis- 
trators, scientists, artists, and writers. Their 
findings, which are very consistent for these 
diverse groups, characterize the creative 
person as intelligent, emotionally mature, 
dominant, adventurous, emotionally sensi- 
tive (feminine), introverted, radical, self- 
sufficient, tense, unsociable, depressive, 
less subject to group standards, and im- 
pulsive. Although five or six of these 13 
scale differences appear to be consistent 
with the characterization of the student 
rated high by teachers in the present study, 
most of Cattell and Drevdahl’s findings 
describe a person who is in many respects 
the opposite of the person teachers seem 
to prefer. In Cattell’s words: “. . . these are 
not characteristics of a pleasant personality, 
differing markedly from those shown for the 
successful salesman or the elected, popular 
leader... .’’ Similar evidence for this in- 
terpretation has been obtained by Getzels 
and Jackson (1958) in a study of “intelli- 
gent” versus “creative” high school stu- 
dents. They find that creative students are 
characterized by their use of stimulus-free, 
humorous, and playful themes in their 
fantasy productions. Moreover, these in- 
vestigators also report that teachers pre- 
ferred the intelligent to the creative stu- 
dents, although both groups are about equal 
in school performance (grades). 

In an extensive series of researches of 
effective work performance and creativity, 
Barron (1957), Woodworth (1958), and 
others describe the creative person in terms 
which seem to support some of the Cattell 
and Drevdahl findings, especially the im- 
pulsive, radical, and dominant qualities 
attributed to creative persons. 

These differences suggest, then, that the 
use of teacher ratings is an inefficient and 
perhaps inadequate method for selecting 
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potentially creative persons. Although high 
teacher ratings have some correlates which 
are similar to a number of the characteris- 
tics associated with creativity, such as 
persistence, high academic aptitudes, and 
perhaps dominance, most of the correlates 
are more indicative of potential for leader- 
ship or academic achievement rather than 
creativity. 

Practically, these results suggest that 
colleges, scholarship sponsors, and other 
organizations interested in the selection of 
potentially creative persons should use 
teacher ratings in a selective fashion. For 
example, they might be used as estimates 
of persistence, dominance, and high aca- 
demic aptitudes in combination with se- 
lected test variables predictive of creative 
behavior. Only a limited reliance on teacher 
ratings as predictors of creativity appears 
desirable, however, since teachers seem to 
prefer students whose potential for creative 
activity, as measured by tests at least, is 
less than that of students rated low. 


SUMMARY 


The significance of teacher ratings has 
been explored by correlating a representa- 
tative teacher rating, Maturity, with a 
variety of personality, achievement, and 
background variables for a large national 
sample of high ability high school seniors. 
When considered with reference to findings 
from other studies, the results suggest that 
teacher ratings are potentially more useful 
as predictors of academic achievement and 
leadership potential than as predictors of 
creativity. 
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A STUDY OF WORDS INDICATING FREQUENCY 


DAVID R. STONE anp RICHARD T. JOHNSON 


Utah State University 


In past years attempts have been made 
to find dimensions of meaning in the field of 
semantics and psycholinguistics. Studies 
have been somewhat sporadic, but recent 
developments indicate a broadening in- 
terest. In an article on the psychophysics of 
semantics, Jones and Thurstone (1955) ap- 
plied the method of successive intervals to 
describe a continuum of meaning from 
“greatest dislike”’ to “greatest like.’”’ Mosier 
(1941) used the same method to establish a 
continuum of meaning through a “‘favora- 
ble-neutral-unfavorable” series. Cliff (1959) 
also used this procedure with 9 adverbs and 
15 adjectives in determining a series of 
scale values from “extremely nice’’ to 
“decidedly bad.’’ Other work in this area 
may be found in Simpson (1944), who at- 
tempted to organize frequency words using 
percentages, Osgood, Suci, and Tannen- 
baum (1957), who have measured meaning 
change with a “semantic differential,” and 
Cohen, Dearnley, and Hansel (1958) in 
England. General background for the en- 
tire field appears in Miller’s (1954) article 
on psycholinguistics. 

The study reported here uses the suc- 
cessive interval approach as outlined by 
Edwards (1952, 1957) with frequency words 
varying from “none of the time’’ to “all of 
the time.” 

Frequency itself is a rather small di- 
mension of meaning, but commonly used. 
The abundance of such terms in educational 
measurement and the consequent confusion 
as to their meanings points out the need for 
study in this area. For example, to decide 
whether an educational practice should be 
advocated “usually” or “occasionally” can 
be of critical importance. Analysis of the 
function of such words in test questions in 
educational measurement seems to have 
been neglected. 
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PROCEDURE 

Over a period of years, 48 terms having 
at least an element of frequency meaning 
were collected from test questions supplied 
for courses in educational psychology. The 
list included words which have implications 
concerning “certainty” and “time” as 
well as frequency. Words appear in this 
fashion in actual questions. For example, 
“was” may be used with a time implication 
as well as a frequency implication. 

Instructions to the subjects were as fol- 
lows: 

In the word list on the next page are words 
and phrases which are used to refer to fre- 
quency of occurrence. For each word or 
phrase, make a check mark to show what 
frequency the word or phrase would mean 
to you if you read it in a book; that is, how 
many times you think the word implies or 
indicates. 

Example: Suppose you read that an event 
happens “sometimes’’ or a ‘few times.” 
Mark the appropriate place for each term 
on one of the blanks on the line opposite 
the word. 


Half All 


» o of of 
Words the the the 
time time time 
Sometimes 


Few Times 


On the next page are words used to ex- 
press judgments of how often or how fre- 
quently something can occur. They range 
from “none of the time”’ to “‘all of the time” 
with “half of the time”’ in the center of the 
scale. 

Read through the entire scale before you 
begin to mark your choices. Work carefully, 
but do not study each word too long. Record 
your first impression of the meaning as you 
would interpret it if you read it. If a word 
seems to have more than one meaning, 
choose the meaning most important to you, 
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If you feel it is impossible to rate the 
frequency of a particular word, leave it 
until you have done the other words, then 
come back and make the best estimate you 
can. 

Mark one space for each word. 


The subjects (Ss) were 158 students in 
general and educational psychology, in- 
cluding freshmen through seniors, regis- 
tered in the fall of 1958. The procedure gave 
them the full list of terms and suggested 
that they carefully read it through before 
marking any item. 

The effect of “anchor” words was dis- 
cussed by Bendig (1955), to the effect that 
rater reliability increases as a function of 
heterogeneity of the stimuli. The anchor 
words (phrases) used here, “none of the 
time”’ to “all of the time’ are, of course, 
very heterogeneous. A center phrase was 
also included. The merit of presenting all 
stimuli together, as was done here, or 
separately for these kinds of stimuli has not 
been settled. 

Ss were asked to use a single choice for 
each word and not to omit any item. This 
was done, since it was assumed that while 
some terms were inexact they were not 
without meaning. 

All responses, no matter how deviate, 
were included. The intent was to parallel a 
class situation which would include some 
carelessness, but 
“rare”? interpretations. Divergent respon- 
have 


also some occasional 


ses, it was discovered, sometimes 
a very logical meaning to the student. For 
example, the very few subjects who marked 
“never” in the “all the time” category were 
included in the study, since in class discus- 
sion a student had said that to him “never 
is always never, so I marked it always.” 
Such discussion took place after the ques- 
tionnaires had been collected. 

The method of successive intervals as 
given by Edwards (1952, 1957) was ap- 
plied to the cumulative choices for each 
of the items. Seale values and discriminal 


dispersions were then calculated. Each item 
was also plotted on normal probability pa- 
per as a check for the calculations. 
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The items on the questionnaire were pre- 
sented in alphabetical order. They are pre- 
sented here in order of magnitude along 
the continuum from “never” to “always.” 


RESULTS AND DISCUSSION 


The scale values and standard deviations 
for the items are given in Table 1, with 
starred values referring to one possible 
selection of terms for a nine-step con- 
tinuum. 

More often than not, the plotted scores 
were linear, thus supporting the general 
hypothesis of this method. Three kinds of 
variations in the data were noted: (a) words 
having extreme frequency meanings showed 
relatively higher dispersions, (b) some words 
displayed skewness, and (c) some displayed 
bimodality. Phrases at the ends of the scale 
showing relatively high dispersions were 
“always,” “cannot”, “hardly ever,” “does 
Phrases showing 
often than 
“requires,” 


not” and “is always.” 


some skewness were “more 


“now and then,” ‘sel- 


‘ 


not,” 


dom,” “sometimes,” and “unpredictable.” 
Bimodal terms were, “‘a possibility exists,” 
“are,” “can,” and “will.” 

The average error or discrepancy score 


for the data was .06. This was determined 


” 


by calculating the average difference be- 
tween the theoretical cumulative propor- 
tions and the actual empirical cumulative 
portions for the items. This error of .06 is 
somewhat larger than the .02 to .03 usually 
reported. We assume that the broader 
scope given in the directions, and the de- 
liberate inclusion of words with frequency 
plus other meaning dimensions, accounts 
for this. 

The larger dispersion of words or phrases 
at the extremes is related to the methodol- 
ogy. Except for “more often than not,” 
and “requires,” skewness seems to be as- 
sociated with an arbitrary limit below the 
“as often as not,” the center category. Bi- 
modality indicates two meanings for a 
term. The instructions possibly limited this 
trend from appearing more, since the Ss 
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TABLE 1 
ScaLte VALUES AND STanparpD Deviations FoR ALL Stimutus ITEMs 
I 

*1. never 03 .90 25. may 2.94 .74 
2. was never 15 1.12 | 26. was 3.13 1.58 
3. cannot 17 1.80 | 27. appears to be 3.18 77 
4. does not -24 1.70 | 28. should 3.35 .80 
5. hardly ever .81 | 2.42 | *29. more often than not | 3.40 .68 
*6. almost never 85 .50 30. requires 3.41 1.03 
7. hardly 91 1.32 31. probably 3.56 .87 
8. very seldom 96 .50 32. can 3.65 1.07 
9. seldom 1.12 1.20 33. characteristically 3.72 .82 
10. not often 1.32 51 34. rather often 3.73 55 
*11. infrequently 1.34 .60 35. considerably 3.75 .63 
12. a possibility exists 1.57 1.31 36. frequently 3.78 1.03 
13. few 1.74 .90 | *37. usually 3.83 54 
14. now and then 1.82 1.08 38. generally 3.86 .99 
15. once in a while 1.86 85 39. often 3.91 -60 
*16. occasionally 2.11 .70 40. many 3.97 .59 
17. perhaps 2.12 .92 41. was usually 4.06 1.41 
18. possibly 2.21 -79 42. very often 4.35 .63 
19. sometimes 2.46 .99 | *43. practically always 4.40 51 
20. has 2.51 1.34 | 44. is 4.84 1.22 
21. unpredictable 2.71 .60 | 45. will 4.86 1.26 
22. is sometimes | 2.78 | .64 | 46. are 4.94 | 1.20 
*23. as often as not 2.87 | .61 | 47. is always 5.19 | 1.61 
*24. 50-50 2.87 | .4l | *48. always | 5.22 | 2.91 


Note.—Starred values refer to one possible selection of terms for nine-step continuum. 


were asked to choose only the most com- 
mon meaning of an item. ; 

The most important part of this study is 
the suggestion of the importance of fre- 
quency words as an area for further study. 
Obviously, frequency ,words are not in- 
tended to convey percentage of occurrence 
in our language. The principal value of work 
on frequency would seem to be in establish- 
ing more consistent relative meanings 
rather than a series of absolute meanings. 

Another consideration is the fact that 
frequency meaning occurs in the same word 
with other dimensions of meaning, such as 
time and uncertainty. This is simply a 
characteristic of the language, but can be 
minimized by careful selection of words. 
The use of forms of the word “‘is,’’ where 
frequency is unstable and often neglected 
or unstated in test questions, poses a par- 
ticular problem. 


Three kinds of application are suggested. 
First, a list of frequency words with their 
semantic variations could be compiled for 
reference. As an example, class discussion 
revealed that, in some cases, the value of 
associated ideas determines that “many” 
can mean “few.” Thus, when three pennies 
are lost for an adult, it is “few,” but to a 
child, it would be “many.” Some meanings 
depend on context and vocal emphasis, as 
in “ean” and “will.” Second, the need for 
such classification could be shown by refer- 
ence to actual test questions. For example, 
in the True-False question, “Changes in 
IQ over a long term reflect an error of 
measurement rather than an actual change 
of mental development,” if the reader as- 
sumes the frequency “sometimes,” the 
question is true, but if he assumes “al- 
ways,” it is false. In the question, “An ac- 
curate estimate of typical performance is 
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obtained by observing the pupil in a typical 
situation,” the student wonders whether 
“is” means “is always,”’ or “is sometimes,” 
or some other unstated frequency. For the 
question, “A pupil’s motivation may be 
so strong as to interfere with good mental 
test performance,” the word “may” neces- 
sitates a different frequency assumption 
than the unstated frequency of “is” in the 
previous question. Third, a logical analysis 
of intervals could be used as another means 
of clarifying and developing a frequency 
continuum, which would be more than 
a simple scaling of how certain Ss now 
respond to the terms. One might begin a 
program of finding logical differences in 
intervals by taking “never’’ as in the first 
interval. Then in a “mutually exclusive” 
logic, see if another term can share the 
category, or whether a new one must be 
made to accommodate the next logical di- 
mension. Note that when “more often than 
is introduced into a situation which 
‘always,”’ 


has been previously restricted to 
“never,” “many,” and “few,” the quality 


of class discussion changes. This forms a 
useful basis for later discussion of statistical 
concepts of significance. 


SUMMARY 


A series of 48 frequency words or phrases 
was selected from classroom tests. A group 
of 158 subjects as directed to rate them ona 
scale from “‘none of the time,” to “all of 
the time.” 

Scale values and dispersions were calcu- 
lated for each item, and they were arranged 
on a continuum of successive intervals. 
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From the results of this study it is possi- 
ble to select items for use in classroom test 
questions which can more consistently 
represent relative meaning for concepts of 
frequency. 
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Burke’s comprehensive review (1958) of 
the literature on the Raven test shows sev- 
eral studies relating the measure to the 
Stanford-Binet, the Wechsler-Bellevue, the 
WISC, and an occasional group test. Vir- 
tually no studies have appeared to examine 
the test’s relationship to specific abilities 
other than those measured by Wechsler 
subtests, or to assess its usefulness with 
young children. 

The review also casts some doubt on the 
assertion that the Raven is virtually a pure 
measure of g. Also, correlations with the 
Binet have generally been lower than that 
reported in the test manual (r = .65), but 
correlations with performance measures 
tend to be slightly higher than with verbal 
measures. 

The subjects of this study' were 98 
kindergarten children, 45 boys and 53 girls 
aged 5-1 to 6-1, selected to represent the 
urban population of the United States by 
parental occupation distributed on the 
Minnesota Scale for Parental Occupations. 
These children were given individually the 
Raven Progressive Matrices (1947), the 
SRA Primary Abilities Test, and the Good- 
enough Draw-a-Man Test by three trained 
and experienced examiners. All testing was 
accomplished within the period of one 
month. 

Means and standard deviations for the 
several tests appear in Table 1. Reference 
to test norms shows that this group is very 
close to typical performance for its age 
(5-6) on all measures. It is at the 6-year 
level on Raven's norms. Table 2 presents 
the product-moment intercorrelations of 
the measures. Correlations calculated sepa- 


Appreciation is due Nolan Kearney, 
Assistant Superintendent of the St. Paul, 
Minnesota schools, who arranged for the 
testing. 
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rately by sex showed no marked and cer- 
tainly no consistent patterns of differences 
and are omitted. 

In the present study, there is only a weak 
relationship to any of Thurstone’s (1953) 
factors measured in kindergarten children. 
Indeed, the Thurstone factors intercorrelate 
for the most part more substantially than 
does the Raven with any of them. Refer- 
ence to the three available studies relating 
the Raven to the WISC shows marked dis- 
crepancies in results among elementary 
school children. There is some tendency for 
the Raven to correlate with Arithmetic, 
Block Design, and possibly Vocabulary 
more substantially than with Comprehen- 
sion, Coding, Mazes, and Object Assembly. 

The odd-even split-half reliability (cor- 
rected) of the Raven with kindergartners 
in the present study is .466. This low value 
would certainly attenuate the relationships 
observed in Table 2. 

In preliminary work for this study, it 
became apparent that for kindergartners 
some of the British expressions in Raven’s 
manual had to be modified slightly. Such 
changes, consistently applied, brought 
quicker understanding of the task. In ad- 
dition, the question, “Is that the right one 
to go in here,” used to direct attention and 
insure a more careful examination of the 
materials, actually seems to shake self- 
confidence, an effect which then persists 
throughout the measure. Probably the di- 
rections need more extensive revision. 

The test appears to be graduated in dif- 
ficulty and there was consequently con- 
siderable waning of interest and enthusi- 
asm, especially in the B series. At best, the 
test proved difficult for five- to six-year- 
olds, especially those in the average reaches 
of ability and below. 
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TABLE 1 


MEANS AND Stanparp Deviations or Scores ON THE Raven, THE THURSTONE PMA, 
AND THE Draw-a-Man Tests, KINDERGARTEN CHILDREN 


Boys (N = 45) 


Girls (NV = 53) Total (VN = 98) 


xX SD xX 


Raven 16.25 2.8 16. 
Primary Mental Abilities: | 
Verbal Meaning | @. 30.34 30. 
Perceptual Speed 9.36 | 11.04 10. 
Quantitative 11.11 11.2 
Motor 21.56 8.8 25.40 23 . 
Space 12.45 11.57 
D-a-M* 2.69 | 15.50 3 


* Values reported here are in terms of a restandardization and extension of the Goodenough test to be published 
shortly. The correlation of the old and new scales is .94 


TABLE 2 REFERENCES 
INTERCORRELATIONS* OF SCORES ON THE 
Raven, PMA, anp Draw-a-Man Tests, 

KINDERGARTEN CHILDREN 


Burke, H. R. Raven’s Progressive Mat- 
rices: A review and critical evaluation. 
J. genet. Psychol., 1958, 93, 199-228. 
Minnesota scale for paternal occupations 
Minneapolis: Univer. of Minnesota, 
Inst. of Child Developm. & Welfare 
(undated). 
Raven, J. C. Guide to using progressive 
matrices (1947). New York: Psycholog- 
Verbal Meaning ical Corp., 1947. 
Perceptual Speed Tuvurstone, Toetma G., & Tuurstone, 
Quantitative 36.70 .60 L. L. Examiner Manual for the SRA 
Motor 22) .34).42 .31 Primary Mental Abilities (ages 5 to 7). 
Space 34.57) .60 .59 .60 Chicago: Science Research Associates, 
D-a-M .22 .50 .44 .54 .40 .51 1953. 


Primary Mental 
Abilities 


*r = 25 significant at 01 level Received March 13, 1959. 
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INTERPRETATION OF RELIABILITY 
AND VALIDITY COEFFICIENTS: 


REMARKS ON A PAPER BY LORD 
LEE J. CRONBACH AND GOLDINE C. GLESER 


University of Illinois' 


It was formerly held that only test scores 
with high reliability and validity were 
practically useful. Taylor and Russell 
(1939) were the first of many writers to 
modify this viewpoint by pointing out con- 
ditions under which a test can make a sub- 
stantial contribution even though its valid- 
ity or reliability is low. We have reviewed 
many of these papers elsewhere (Cronbach 
and Gleser, 1957). 

Lord (1958) has recently contributed a 
valuable paper on the usefulness of un- 
reliable difference scores. Two comments 
are to be made about this paper. a) Lord’s 
method of analysis is general and applies 
to all reliability and validity coefficients; 
his paper therefore has implications far 
beyond the interpretation of difference 
scores. 6) Modifying Lord’s evaluation 
procedure in one particular leads to an 
important change in the conclusions. For 
many decision makers Lord’s formulation 
is less suitable than the alternative anal- 
ysis, and his interpretation regarding the 
value of tests is insufficiently conservative. 

Most statements describing the useful- 
ness of tests as judged from their reliability 
or validity coefficients assume that a de- 
cision is made about every person tested. 
The Cooperative Test Division (1955) of 
ETS, in making recommendations to inter- 
preters of certain aptitude and achievement 
batteries, adopts the contrary position that 
decisions might better be made only about 
persons for whom the test provides de- 
pendable information. Bloom (1942) noted 
that even an unreliable test permits one to 


1 This study was aided by USPHS Grant 
M-1839. The comments of F. M. Lord on a 
draft of this paper are gratefully acknowl- 
edged. 


divide a group into a few broad categories 
with considerable confidence. In applying 
this concept to differences between sub- 
tests of the SCAT and STEP batteries, the 
CTD suggests that, where there is a large 
difference, a test permits an accurate in- 
ference that one true score is higher than 
another, even though the difference score 
has quite modest reliability. For persons 
with small observed differences, on the 
other hand, the CTD suggests that the 
best course of action is to make no differ- 
ential interpretation. Specifically, the fol- 
lowing rule is proposed: If a difference score 
is larger in absolute value than k, interpret 
it as a true difference; if it is less than k, 
act as if there is no difference, at least until 
further information about the person is 
taken into account. Considering the V and 
Q scores for SCAT, for example, this strat- 
egy calls for assigning the person to one of 
three groups: V > Q, V < Q, and no dif- 
ference established. 
Srratecy Fixep a Risk 

The value of k may be determined in 
many ways. The CTD proposal makes k 
proportional to the standard error of meas- 
urement of the difference score; specifically, 
k is set equal to V2 S.E.. A paper by 
Mendenhall (1959) adopts a similar ap- 
proach, setting k equal to 1.96 S.E... We 
shall refer to a strategy which makes the 
cutting score a multiple of S.E. as a “fixed 
a” strategy, for reasons which will be 
made clear shortly. 

The most obvious virtue of the CTD rule 
is the convenience with which it may be 
applied. It is recommended for such bat- 
teries as SCAT and STEP that each score 
be plotted on the profile sheet not as a 
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point, but as a band extending 1 S.E. 
above and below the observed score. A 
difference larger than 4/2 S.E. will be 
present only when the bands plotted do not 
overlap. The counselor is instructed to 
interpret only differences where the pupil’s 
V-score and Q-score bands show no over- 
lap (i.e., if V is plotted as a band 40—50 
and Q as a band 48—58, no interpretation 
is made). 

Lord’s paper is devoted to an evaluation 
of th s strategy. He takes into account two 
consequences of the rule, as applied to a 
difference score with a specified reliability: 
the proportion of persons about whom dif- 
ferential interpretations are made (p), and 
the average risk (q,) of making a differential 
interpretation when the true difference is in 
the opposite direction. For example, when 
raa = .42, and difference scores are con- 
verted to a scale with unit standard devia- 
tion, S.E.4 = .76 and k = 1.07. Then 28% 
of the subjects have differences greater 
than 1.07, and in 90% of those cases the 
observed difference is in the same direction 
as the true difference. Hence, the average 
risk of an incorrect differential interpreta- 
tion is 10% when the CTD rule is applied 
to this test. 

This argument (like Bloom’s) sets aside 
the co clusion of Bennett and Doppelt 
(1948) that the minimum acceptable reli- 
ability for a difference score is about .75. 
Their position was based on Kelley’s (1923) 
calculation of “the proportion of differences 
in excess of chance.” Since this proportion 
does not relate in any direct way to the 
goodness of decisions based on the differ- 
ence score, as a basis for evaluating a test 
it is much inferior to Lord’s which has a 
clear relation to the utility of decisions. 

Lord concludes that the CTD rule is an 
acceptable one, as the average risk is low 
even when the score has a reliability as low 
as .40. Indeed, he points out that if raa > 
.90, the average risk is extremely low, so 
that the counselor ignores differences which 
could very safely be interpreted. He implies 
that a better strategy would be to adjust k 
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so as to maintain a fixed average risk, no 
matter what the score reliability. This may 
be referred to as a “fixed g,” strategy, and 
the difference between fixed a@ and fixed 
qc strategies may be explained with refer- 
ence to Fig. 1. 

In this sketch, z is any score which is to 
be interpreted, and y is the score with 
respect to which persons would ideally be 
classified (criterion score, true score, or 
true difference score). Persons are to be 
identified who may confidently be classified 
as having y > y’ ory < y’. In the problem 
of identifying nonzero differences, y’ = 0. 
Persons for whom z > k are classified as 
having y > y’. Under the fixed @ strategy, 
k is placed on the z scale at a distance from 
zero determined by 8.E.,. The standard 
error is the standard deviation of any hori- 
zontal array. The line z = k cuts off a 
certain proportion of persons in the array 
where y = 0, i.e., where the null hypothesis 
is true. We may refer to this proportion as 
a/2, recognizing that there are an equal 
number of cases where y = Oandz < —k. 
Then a is the risk of incorrectly making a 
differential interpretation when the null 
hypothesis holds. Setting k equal to a fixed 
multiple of S.E. has the effect of holding 
@ constant as r,, varies. Specifically, when 
k = V25S.E., ais fixed at .16. 

The average risk g. , which Lord uses to 
evaluate the fixed a strategy, takes into 


True or criterion score y 


-k k 
Observed score x 


Fic. 1. Sxercn To a Risk 
AND AVERAGE Risk q 
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account all cases for whom a decision is 
reached, i.e., all cases where r > korx < 
—k. Those in areas B and C of Fig. 1 are 
erroneously interpreted. The proportion of 
decisions p reached is the sum of the vol- 
umes under the normal bivariate distribu- 
tion, pa + Pa + Pe + po. The proportion 
of correct decisions p. = (p4 + pp)/p and 
the average risk g. = (pp + Pe)/p. 

Lord suggests that the value of k might 
be adjusted, as r,, changes, to keep qe 
constant. This would require that k be a 
smaller multiple of S.E. as r increases. 
Though Lord appears to regard this fixed 
qe strategy as superior to the fixed @ 
strategy, he does not discuss it in detail, 
and we shall give it no further direct atten- 
tion. 

Our paper differs from Lord’s in placing 
emphasis upon the maximum risk of erro- 
neous interpretation, rather than upon the 
average risk. It is obvious that the risk of 
a wrong decision is greater for the person 
whose observed score is near the cutting 
point & than for the person with an extreme 
score. Looking at only the average risk, as 
Lord does, one may conclude that a pro- 
cedure is conservative even when appreci- 
able risks are taken in making decisions 
about persons near the borderline. The 
CTD rule proposes to interpret differences 
which, considered individually, are quite 
likely to be due to chance. Specifically, in 
the example considered above where ry. = 
.42, Lord reports an average error rate of 
10%, but we find that for persons with 
differences near k the expected error rate is 
18%. Some users who would be quite pre- 
pared to accept 1 erroneous interpretation 
in 10 would not consider an error rate close 
to 1 in 5 as adequately conservative. 

Thoug 
appropriate loss function to use in evalu- 
ating a strategy, maximum risk is more 
appropriate in other situations. Arbous and 
Sichel (1952) and Arbous (1952) have com- 
pared the two in discussing industrial 
selection. They point out that a test of low 


average risk is sometimes an 


LEE J. CRONBACH AND GOLDINE C. GLESER 


validity may profitably be used as the first 
stage in a sequential procedure for selecting 
employees, where unpromising applicants 
are ruled out and the remainder are given 
a further test. The benefit of this procedure 
to the institution (employer) may properly 
be judged in terms of the average quality 
of the men finally selected and the cost of 
testing per man hired. Such emphasis on 
average risk is not, however, appropriate 
from the individual’s viewpoint. Arbous 
and Sichel protect the interests of the in- 
dividual by fixing a cutting score x’ on the 
pretest such that the marimum risk @ of a 
false decision (rejection of a man who would 
pass the second test) is .001 or some other 
suitable value. Scores near the cutting 
point z’ can then be interpreted at a pre- 
determined level of confidence, and more 
extreme scores can be interpreted with 
even greater confidence. This is preferable 
to a strategy yielding a specified average 
risk whenever individual rather than in- 
stitutional decisions are being made, since 
the risk for any single individual is limited. 
As discussed by us elsewhere (1957, p. 6 ff.), 
an institutional decision is one of a series of 
decisions all of which contribute to the 
benefit of the same institution. An individ- 
ual decisioii ts one intended to serve the 
interests of an individual; it recurs rarely 
or never, and consequently the individual 
cannot average his risks over many deci- 
sions. Decisions reached in counseling and 
guidance are individual decisions. 

Introducing the concept of maximum 
risk raises two questions: How great is the 
maximum risk under the fixed @ strategy, 
and what light does this shed on the con- 
clusions of Lord and Mendenhall? What 
procedure would guarantee that the max- 
imum risk does not exceed a specified value, 
and how satisfactory is such a “maximum 
@” strategy? We begin with the former 
question. 

For convenience, we shall express x in 
units such that = 0,8, = 1, andr,, > 0. 
The scale for y will be given by the data 
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if yisa criterion score, but if y is a true 
= 
Figure 1 aa the scatter diagram relating 
z and y, however these may be defined. 
Then within any vertical array (z fixed) 
y is normally distributed with 
=9+r.,3,2 and =sYfl—r, 
The line y = y’ cuts this distribution at a 
point whose location, expressed as a nor- 
distribution, 


score jg =F =0, and s, = 


mal deviate within the is m. 


By the usual transformation, 


Vi 


{1} 


Now @, is the proportion of the cases falling 
above or below m. The strategy under 
consideration will employ two cutoffs, as 
shown in Fig. 2: if z < 2’, the person will 
be classified as having y < y’; if z > 2’, 
as having y > y’. (Introducing 2’ and 2’, 
which need not be symmetric about zero, 
formulates the problem more generally 
than does the use of +k as cutting scores.) 
Since we are concerned with the risk of 
misclassification, the proportion above m 
is taken as the risk @ when z < 2’, and the 
proportion below m is used when z > 2”. 


arrays where zx < 2’, the 
zx’; similarly, 
risk 
maxi- 
the 
associated values of m may be referred to 


Among those 
risk @ is greatest when z = 
the arrays where x > 2”, the 
zx”. Call the se 

respectively ; 


among 
is greatest when xz = 
mum risks and @”, 


as m’ and m”. 

For any specified y’, 2’ and x”, the maxi- 
mum risks may be determined by substi- 
tuting in [1] to obtain m’ and m”, and using 
the tabled normal distribution to find the 
area above m’ (call this ¢’) and the area 
below m” (¢”). In the CTD interpretation 
of difference scores, persons are to be clas- 
sified as above or below y’ = 0. The fixed 
a@ procedure —V2(1 — rez): 
Noting that r.. = r*., and g = 0,and sub- 
stituting in [1], we find that m’ = hast 
Likewise m” = —+/2r,,. . Over the range 
of values of rag, the CTD proposal fixing 


sets 2’ = 
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a at .16 and y’ at 0 leads to the following 
consequences: 
Proportion of 
persons for Average Maximum 
whom decisions risk risk 
lad are made (p) = 9”) 
.78 
10 
.37 14 
.20 if .26 
.00 50 
columns are similar to 


(The first three 


Lord’s.) 


When raz 1, m’ > and m” 

so that under the CTD procedure, the 
maximum risk is never less than .078. The 
maximum risk is of course greater than 
the average risk = 0), and 
becomes several times as large as g, when 


(save where rug 


Tea i8 large. 

The acceptable risk depends on the type 
of decision being made. In individual deci- 
sions (particularly counseling), it is gen- 
erally desirable to be conservative, seeking 
additional information rather than accept- 
ing a hazardous conclusion. When a ter- 
minal decision is under consideration, it 
appears reasonable to set the maximum 
risk at .10 or .05. An even lower level 
might be desired for an important decision 
that could not be reversed should it prove 
to be wrong in the light of later experience. 
On the other hand, some counseling inter- 
pretations are easily and cheaply reversed 
as more information comes to light (e.g., 
performance in verbal and mathematical 
courses may reverse an impression of differ- 
A risk of .20 
seems none too high for a tentative deci- 


ence given by test scores). 


sion where reversal costs little. 

The problems considered to this point 
(1959) 
paper. His discussion is in many ways like 
that of Lord, representing an attempt by 
a test-publishing organization to state how 
useful is differential information from one 
of its tests. Following conventional statis- 
tical logic, Mendenhall calls an observed 


are also pertinent to Mendenhall’s 
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difference “significant at the .05 level” if 
it exceeds 1.96 S.E.4. He then judges the 
utility of the test by calculating the propor- 
tion of cases expected to have “significant” 
differences. When ria = .81, for example, 
he finds that 39% of the persons have sig- 
nificant difference scores. He thus implies 
that a cutoff at +1.96 S.E. would permit 
decisions to be made about 39% of the 
cases, with 5% risk of misclassification. 
This is a fixed @ strategy, differing from 
the CTD proposal in that 1.96 replaces 
1.41 as a multiplier, so that @ is .05 in- 
stead of .16. Mendenhall calculates, as does 
Lord, the proportion p of cases for whom 
a decision is reached. He fails, however, to 
consider that the @ risk is not an indica- 
tion of the dependability of the decisions 
made, a matter with which the decision 
maker is normally concerned. The table 
above shows that both the average risk p, 
and the maximum risk @ may exceed a. 
Moreover, we find that when rag = .81, we 
can make decisions about only 26% of the 
cases with a guaranteed @ risk no greater 
than .05 (cf. Mendenhall’s 39% above). 
The risk of interpreting a null difference 
(a) is not the same as the risk @ of mis- 
interpreting (reversing) a difference. The 
a risk answers the question: Given a per- 
son for whom the true difference is zero, 
how likely are we to interpret an observed 
difference for him? The @ risk answers the 


True or criterion score y 


x x 
Observed score x 
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question: Given a person with a borderline 
observed difference, how likely are we to 
be incorrect in interpreting that difference? 
Mendenhall’s analysis, though technically 
accurate, by implication gives an unduly 
favorable impression of the value of the 
difference scores in question. 


Srratecy Wits Firxep Maximum ¢ Risk 


We turn now to the consideration of a 
strategy designed to fix the maximum risk 
¢’ and @” at some stated level ¢,,. It is 
desired to use x scores to identify individ- 
uals who are above and below some level 
y’, with a maximum risk @,, of an incorrect 
identification. Two cutting scores on the z 
scale are determined: a lower score zx’ such 
that when z = 2’, P(y > y’) = om, and 
an upper score z” such that when z = 
< y’) = dm (see Fig. 2). From 
the given @¢,,, m is determined from the 
normal table. For the lower cutting score 
x’, dm is the proportion of cases in the 
upper tail, hence, the sign of m is posi- 
tive. For the upper cutting score zr” ,¢,, is 
the proportion of cases in the lower tail, 
hence the sign of m is negative. Solving 
{1] for x gives 


Tables 1 and 2 indicate the consequences 
of applying such a maximum ¢ strategy at 
various levels of reliability or validity. 

In Table 1, persons are to be judged as 
above or below average. Though this table 
covers the problem Lord considered, of de- 
tecting positive and negative differences 
within a profile, it also applies to interpret- 
ing any other decision about standing rela- 
tive to the group mean. For high reliability 
(or validity) the results under this strategy 
are very similar to the results reported by 
Lord for the fixed @ strategy, with regard 
to both number of decisions made and aver- 
age risk. For low reliabilities, however, we 
find that the number of decisions that can be 
made with confidence is much lower than 
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TABLE 1 


Otiity or Test ror CLAssiryInG Persons as ABOVE AND BeLow AVERAGE WITH 
Maximum Risk ¢ = .10 
Percentage of Persons 
for Whom Decision 
is Reached Proportion 
Correct 


Validity | Reliability 


Any Correct 
Decision Decision 


pa + Pe 


100.0 
66.4 


| ~ 


TABLE 2 


Uninity or Test ror CLAssiFrYING Persons as ABOVE OR BELow y = +Sy, 
with Maximum Risk .10 


Identification of Superior Cases (Identification of Nonsuperior Cases Total Decisions 


Validity, | Decisions Decisions 
Index of Reached 
Reliability bpor- | tower | Number 
° Cutoff Reached 


| 


| 
| 


Lord’s report suggests. Even when rag = 0, It is of interest that average risk qg, re- 
the CTD strategy allows decisions regard- mains much more constant over the pos- 
ing 16% of the subjects, but our strategy sible range of reliabilities than it does under 
allows none. The maximum @¢ strategy, on the fixed a strategy. The maximum @ 
the other hand, runs a greater risk of over- strategy to some degree overcomes the 
looking a true difference than does the difficulty which led Lord to suggest a fixed 
fixed a strategy. q. strategy in place of the fixed a proce- 


= 
+ 
| | | 
be Ge 
| 1 1.000 .000 1.00 
.985 015 18 
| 82.2 976 024 15 
32.4 961 039 
18.4 .052 07 
8.4 936 064 .03 
2.5 .932 .068 01 
: 
| 
——- -— - 
|... Propor- 
? eci- ech- | cl- 
Fr sion | Decision sion | sion 
1.00 | 1.00 | 15.9 | 15.9 | 1.00 | 1.00) 1.00 | 100. | 100. | 1.00 
95 |1.47/ 7.1] 6.9 | .972| .63 .992 | 80.7] 79.9] .990 
.90 | 1.73 | 4.2| 4.0 .952 | .987] 73.0] 71.9] .985 
80 |2.21| 1.4] 1.3 | .29 | .977| 62.8] 61.3] .976 
70 | 2.73) .3| .25| .833] | .965| 55.5] 53.6] .965 
60 | 3.37 — —.03} .959 | 48.8] 46.8] .959 
50 — 947 | 41.3] 39.1] .947 
.40 | — .43} | .940| 33.4] 31.4] .940 
.30 | | | —.73) | 23.3] 21.5] .923 
= 20 | |—1.25] 906} 10.6) .906 
.10 —2.74) | 3 
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dure. Fixed gq, and maximum ¢ strategies 
are by no means identical, how»ver. While 
an average risk g, of .01-.07 corresponds 
to a maximum risk @ of .10, further caleu- 
lations indicate that gq, must b set near 
.002 to guarantee that ¢ is no larger than 
05. 

Table 2 deals with the situation where 
it is desired to discriminate persons with 
y > + 1s, from those below that point. 
The upper cutoff is used in identifying 
persons with a marked superiority. The 
lower cutoff is used to identify persons for 
whom y is thought to be less than y’. (The 
same values, with a change in sign of 2’ 
and x”, apply when the test is used to 
identify persons with a marked weakness.) 

Our tables provide a corrective to the 
optimism of Lord’s table. According to 
Table 1, a score with validity .80 or relia- 
bility .64 permits classifying one-third of 
the subjects as above or below average on 
the criterion (or true score) with a maxi- 
mum risk of .10. Likewise, a difference 
score of reliability .64 permits us to report 
one-third of the subjects as having definite 
positive or negative differences with, at 
most, 1 chance in 10 of being incorrect. If 
we set the tolerable risk ¢, at 1 in 20, 
reliability must be about .85 to permit an 
equal number of decisions. According to 
Table 2, the test of low reliability or valid- 
ity permits a somewhat greater number of 
decisions as y’ moves away from the mean, 
but this gain is mostly in singling out non- 
deviates, i.e., persons for whom y is not 
more extreme than y’. A test with reliabil- 
ity .64 identifies, at the desired level of 
confidence, less than one-tenth of the supe- 
rior persons for whom y > +1 8,. 

It somewhat oversimplifies the problem 
to treat all errors of classification as equally 
serious. In a specific situation, the most 
satisfactory analysis of the usefulness of a 
test and decision-making strategy would 
usually be obtained by specifying for each 
y the exact benefit or loss from each pos- 
sible decision (Cronbach & Gleser, 1957, 
pp. 44-46). A much simpler formulation, 
however, will often be appropriate. Sup- 
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pose that in evaluating certain difference 
scores for counseling purposes, it is recog- 
nized that large differences are much more 
important to detect than small ones. The 
level y’ may be specified so as to distinguish 
between true differences regarded as impor- 
tant and those regarded as trivial. Using 
Equation [2], cutting scores may be deter- 
mined so as to permit the judgment y > y’ 
or y < —y’ with an acceptable risk of 
error. Two symmetric cutting scores are 
determined: a lower score x’ such that for 
x <2’, P(y > —y’) and an upper 
score, x”, such that forz > 2”, P(y <y < 
¢,, . Interpretations or decisions are made 
for persons for whom z < 2’ and z> 2”. 
If 2’ < zx < 2x”, no decision is made. The 
operation of this strategy may be illustrated 
by reference to Table 2, if we suppose that 
true differences less than 1 SD in absolute 
value are considered negligible. Then the 
number of persons confidently (6, < .10) 
identified as having a large difference in one 
direction or the other is obtained by dou- 
bling the entries in the pa + ps column, 
and the average accuracy is given by po, 
(without doubling). Obviously, a test will 
identify very few persons as having differ- 
ences greater than 1 SD if its reliability is 
below .80 (index of reliability below .90). 


Discussion 


We have identified three risks which 
may be taken into account in fixing strate- 
gies for test interpretation and for evaluat- 
ing the usefulness of a test interpreted by 
a particular strategy. It is assumed that 
the persons are to be divided into three 
classes: those whose true scores (or true 
differences between scores) are believed to 
be greater than a specified criterion score 
y’; those believed to have true scores less 
than y’; and persons for whom neither in- 
terpretation may safely be made. 

The risk a@ is the risk of interpreting a 
score as indicating y > y’ ory < y’ when 
y actually equals y’. The risk q is the 
average risk, over all decisions made, of 
concluding that y > y’ when y actually is 
less than y’, and vice versa. The risk $’ or 
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@” is this same risk of misinterpretation 
for persons at the score where the risk of 
misinterpretation is greatest. The CTD 
and Mendenhall suggest a strategy which 
fixes the a risk at a predetermined value. 
Fixing the a@ risk is a logically defensible 
method of establishing a strategy. It is not 
appropriate, however, to describe the util- 
ity of decisions actually made in terms of 
the a@ risk, as Mendenhall does. The an- 
swer to this question is contained in the q, 
and @ risks. The former is more important 
in institutional decisions (e.g., selection, 


_« «alacement) and the latter in individual de- 


cisions (e.g., counseling). A strategy de- 
signed to fix either q. or ¢, , depending on 
the type of decision, is logically to be pre- 
ferred over a fixed @ strategy. The fixed 
a strategy, as Lord shows, is unduly con- 
servative when applied to highly reliable 
scores. In addition, it results in a high rate 
of error for scores of very low reliability 
and validity. 

We agree with Lord and Bloom that 
there is no arbitrary level of validity or of 
reliability which makes a score useful. The 
suitability of a test depends upon these 
coefficients, but it also depends upon the 
importance of the decisions to be made and 
on the rules by which scores are to be con- 
verted into interpretations. Where a test is 
used, for example, to identify those persons 
who are clearly above or below the mean, 
a test of reliability .49 permits confident 
decisions (@, = .10,q. = .05, a = .07) 
regarding 19% of those tested. This may 
or may not be a profitable information 
yield, depending on the situation, cost of 
testing, etc. Where the primary aim is to 
select superior individuals (more than 1 SD 
above the mean), the test reliability must 
reach .86 before as many as one-third of 
the superior individuais are identified with 
om = .10, ¢- = .035. On the other hand, a 
test of reliability .36 is capable of identify- 
ing more than half of those who are defi- 
nitely not superior; as the first stage in a 
sequential screening process, this test can 
rule out persons who need be given no fur- 
ther consideration. 
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The test designer and selector of tests 
must abandon his quest for a rule of 
thumb, and instead interpret Tables 1 and 
2 (and similar tables for other decision 
problems and risk levels) in the light of his 
particular situation. For the typical coun- 
seling decision, it is our opinion that the 
maximum individual risk ¢,, is the most 
important consideration in determining the 
interpretability of scores. From this point 
of view, the difference scores for certain 
published batteries discussed by Lord and 
Mendenhall are somewhat less useful than 
their papers imply. 
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